Create TensorFlow Task

Updated at：2025-10-27

You can create a new task specifically of the TensorFlow type.

Prerequisites

You have successfully installed the CCE AI Job Scheduler and CCE Deep Learning Frameworks Operator components; without these, the cloud-native AI features will be unavailable.
As an IAM user, you can only use a queue to create new tasks if you are part of the users linked to that queue.
Installing the CCE Deep Learning Frameworks Operator component will also install the TensorFlow deep learning framework.

Operation steps

Sign in to the Baidu AI Cloud official website and enter the management console.
Go to Product Services - Cloud Native - Cloud Container Engine (CCE) to access the CCE management console.
Click Cluster Management - Cluster List in the left navigation pane.
Click on the target cluster name in the Cluster List page to navigate to the cluster management page.
On the Cluster Management page, click Cloud-Native AI - Task Management.
Click Create Task on the Task Management page.
On the Create Task page, configure basic task information:

Screenshot 5/31/2024 5.27.38 PM.png

Specify a custom task name using lowercase letters, numbers, “-”, or “.”. The name must start and end with a lowercase letter or number and be 1-65 characters long.
Namespace: Choose the namespace for the new task.
Queue: Choose the corresponding queue for the new task.
Task priority: Set the priority level for the task.
Allow overcommitment: Enable this option to use task preemption for overcommitment. The CCE AI Job Scheduler component must be installed and updated to version 1.4.0 or higher.
Tolerance for delay: The system will prioritize scheduling tasks or workloads to fragmented cluster resources to enhance cluster resource utilization, though this might impact business latency performance.

Configure basic code information:

Screenshot 5/31/2024 5.32.15 PM.png

Code Configuration Type: Choose the method for code configuration, currently supporting "BOS File," "Local File Upload," "Git Code Repository," and "Not Configured Temporarily."
Execution command: Define the command to execute the code.

Configure data-related information:

Set data source: Currently supports datasets, persistent volume claims, temporary paths, and host paths. For datasets: All available datasets are listed; selecting a dataset automatically matches the PVC with the same name. For persistent volume claims: Choose the persistent volume claim directly.

Click "Next" to proceed to container-related configurations.
Configure task type information:

Screenshot 5/31/2024 5.40.36 PM.png

Select framework: Choose TensorFlow.
Training method: Select either Single-Machine or Distributed training.
Select Role: For "Single-machine" training, only "Worker" can be chosen. For "Distributed" training, additional roles such as "PS," "Chief," and "Evaluator" can be selected.

Configure pod information (advanced settings are optional).

Screenshot 5/31/2024 5.41.49 PM.png

Specify the number of pods desired in the pod.
Define the restart policy for the pod. Options: “Restart on Failure” or “Never Restart”.
Provide the address for pulling the container image. Alternatively, click Select Image to choose the desired image.
Enter the image version. If left unspecified, the latest version will be used by default.
Container Quota: Specify the CPU, memory, and GPU/NPU resource allocation for the container.
Environment Variables: Enter the variable names and their corresponding values.
Lifecycle: Includes start commands, parameters, actions after startup, and actions before stopping, all of which can be customized as needed.

Configure the advanced task settings.

Screenshot 5/31/2024 5.43.32 PM.png

Add credentials to access the private image registry if using a private image.
Tensorboard: If task visualization is required, the Tensorboard function can be enabled. After enabling, you need to specify the “Service Type” and “ Training Log Reading Path”.
Assign K8s labels to the task.
Provide annotations for the task.

Click the Finish button to finalize task creation.

Example of creating a task with YAML

Plain Text

1apiVersion: "kubeflow.org/v1"
2kind: "TFJob"
3metadata:
4  name: "tfjob-dist-mnist-for-e2e-test"
5spec:
6  tfReplicaSpecs:
7    PS:
8      replicas: 2
9      restartPolicy: Never
10      template:
11        metadata:
12          annotations:
13            sidecar.istio.io/inject: "false"
14            # if your libcuda.so.1 is in custom path, set the correct path with the following annotation
15            # kubernetes.io/baidu-cgpu.nvidia-driver-lib: /usr/lib64
16        spec:
17          schedulerName: volcano
18          containers:
19            - name: tensorflow
20              image: registry.baidubce.com/cce-public/kubeflow/tf-dist-mnist-test:1.0
21              resources:
22                requests:
23                  cpu: 1
24                  memory: 1Gi
25                limits:
26                  baidu.com/v100_32g_cgpu: "1"
27                  # for gpu core/memory isolation
28                  baidu.com/v100_32g_cgpu_core: 10
29                  baidu.com/v100_32g_cgpu_memory: "2"
30              # if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
31              # ${'`'}dist_mnist.py${'`'} needs to be replaced with the name of your gpu process.
32              lifecycle:
33                preStop:
34                  exec:
35                    command: [
36                      "/bin/sh", "-c",
37                      "kill -10 `ps -ef | grep dist_mnist.py | grep -v grep | awk '{print $2}'` && sleep 1"
38                    ]
39    Worker:
40      replicas: 4
41      restartPolicy: Never
42      template:
43        metadata:
44          annotations:
45            sidecar.istio.io/inject: "false"
46            # if your libcuda.so.1 is in custom path, set the correct path with the following annotation
47            # kubernetes.io/baidu-cgpu.nvidia-driver-lib: /usr/lib64
48        spec:
49          schedulerName: volcano
50          containers:
51            - name: tensorflow
52              image: registry.baidubce.com/cce-public/kubeflow/tf-dist-mnist-test:1.0
53              env:
54                # for gpu memory over request, set 0 to disable
55                - name: CGPU_MEM_ALLOCATOR_TYPE
56                  value: “1”
57              resources:
58                requests:
59                  cpu: 1
60                  memory: 1Gi
61                limits:
62                  baidu.com/v100_32g_cgpu: "1"
63                  # for gpu core/memory isolation
64                  baidu.com/v100_32g_cgpu_core: 20
65                  baidu.com/v100_32g_cgpu_memory: "4"
66              # if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
67              # ${'`'}dist_mnist.py${'`'} needs to be replaced with the name of your gpu process.
68              lifecycle:
69                preStop:
70                  exec:
71                    command: [
72                      "/bin/sh", "-c",
73                      "kill -10 `ps -ef | grep dist_mnist.py | grep -v grep | awk '{print $2}'` && sleep 1"
74                    ]

View Task Information

Example of RDMA Distributed Training Based on NCCL

CCE CCE

CCE CCE

Create TensorFlow Task

Prerequisites

Operation steps

Example of creating a task with YAML