Create AI Training Task

Updated at：2025-10-27

Create a new task of the AITraining type. An AITrainingJob is an optimized AI training task that enhances computing efficiency, resource utilization, fault tolerance, and flexible scheduling.

Prerequisites

You have successfully installed the CCE AI Job Scheduler and CCE Deep Learning Frameworks Operator components; without these, the cloud-native AI features will be unavailable.
As an IAM user, you can only use a queue to create new tasks if you are part of the users linked to that queue.
When you install the CCE Deep Learning Frameworks Operator component, the PyTorch deep learning framework is installed automatically.

Operation steps

Sign in to the Baidu AI Cloud official website and enter the management console.
Go to Product Services - Cloud Native - Cloud Container Engine (CCE) to access the CCE management console.
Click Cluster Management - Cluster List in the left navigation pane.
Click on the target cluster name in the Cluster List page to navigate to the cluster management page.
On the Cluster Management page, click Cloud-Native AI - Task Management.
Click Create Task on the Task Management page.
On the Create Task page, configure basic task information:

Screenshot 5/31/2024 5.27.38 PM.png

The task name supports uppercase and lowercase letters, numbers, and special characters (-_/), must start with a letter or Chinese character, and should be 1 to 65 characters long.
Namespace: Choose the namespace for the new task.
Select queue: Choose the queue associated with the new task.
Task priority: Set the priority level for the task.
Allow overcommitment: Enable this option to use task preemption for overcommitment. The CCE AI Job Scheduler component must be installed and updated to version 1.4.0 or higher.
Tolerance for delay: The system will prioritize scheduling tasks or workloads to fragmented cluster resources to enhance cluster resource utilization, though this might impact business latency performance.

Configure basic code information:

Screenshot 5/31/2024 5.32.15 PM.png

Code configuration type: Specify the code configuration method. Current options include “BOS File,” “Local File Upload,” and “Not Configured Temporarily.”
Execution command: Define the command to execute the code.

Configure data-related information:

Screenshot 5/31/2024 5.33.38 PM.png

Set data source: Currently supports datasets, persistent volume claims, temporary paths, and host paths. For datasets: All available datasets are listed; selecting a dataset automatically matches the PVC with the same name. For persistent volume claims: Choose the persistent volume claim directly.

Click "Next" to proceed to container-related configurations.
Configure task type information:

Select framework: Choose “AITrainingJob”, and specify the training mechanism. Currently, you can select “horovod” or “paddle”.
Training method: Select either Single-Machine or Distributed training.
Select role: For the "Single-machine" training method, only the "Trainer" role is available. For the "Distributed" method, both "Launcher" and "Trainer" roles are available, and the elastic range of pods must be defined.

Configure pod information (advanced settings are optional).

Screenshot 5/31/2024 5.41.49 PM.png

Specify the number of pods desired in the pod.
Define the restart policy for the pod. Options: “Restart on Failure” or “Never Restart”.
Provide the address for pulling the container image. Alternatively, click Select Image to choose the desired image.
Enter the image version. If left unspecified, the latest version will be used by default.
Set the CPU, memory, and GPU resource requirements for the container.

Configure the advanced task settings.

Screenshot 5/31/2024 5.43.32 PM.png

Set the maximum allowable training duration (leave blank for unlimited duration).
Add credentials to access the private image registry if using a private image.
Tensorboard: If task visualization is required, the Tensorboard function can be enabled. After enabling, you need to specify the “Service Type” and “ Training Log Reading Path”.
Assign K8s labels to the task.
Provide annotations for the task.

Click the Finish button to finalize task creation.

Example of creating a task with YAML

Plain Text

1apiVersion: kongming.cce.baiudbce.com/v1
2kind: AITrainingJob
3metadata:
4  name: job-horovod-test
5  namespace: default
6spec:
7# Cleanup policy for pods when the task ends. “All” means all pods, “none” means no cleanup
8  cleanPodPolicy: All
9# Completion policy. “All” means the task is completed when all pods are completed, “Any” means the task is completed when any pod is completed
10  completePolicy: Any
11# Failure policy. “All” means the task failed when all pods failed, “Any” means the task is completed when any pod is completed
12  failPolicy: Any
13# Support horovod and paddle frameworks
14  frameworkType: horovod
15# Elastic option. “true” means enabling elasticity, “false” means not enabling. When enabling, the fault tolerance option of the trainer container needs to be enabled
16  faultTolerant: true
17  plugin:
18    ssh:
19      - ""
20    discovery:
21      - ""
22  priority: normal
23  replicaSpecs:
24    launcher:
25      completePolicy: Any
26      failPolicy: Any
27      maxReplicas: 1
28      minReplicas: 1
29      replicaType: master
30      replicas: 1
31      restartLimit: 100
32      restartPolicy: OnNodeFailWithExitCode
33      restartTimeLimit: 60
34      restartTimeout: 864000
35      template:
36        metadata:
37          creationTimestamp: null
38        spec:
39          initContainers:
40            - args:
41                - --barrier_roles=trainer
42                - --incluster
43                - --name=$(TRAININGJOB_NAME)
44                - --namespace=$(TRAININGJOB_NAMESPACE)
45              image: registry.baidubce.com/cce-plugin-dev/jobbarrier:v0.9
46              imagePullPolicy: IfNotPresent
47              name: job-barrier
48              resources:
49                limits:
50                  cpu: "1"
51                  memory: 1Gi
52                requests:
53                  cpu: "1"
54                  memory: 1Gi
55              restartPolicy: Never
56              schedulerName: volcano
57              terminationMessagePath: /dev/termination-log
58              terminationMessagePolicy: File
59              securityContext: {}
60          containers:
61            - command:
62                - /bin/bash
63                - -c
64                - horovodrun -np 3 --min-np=1 --max-np=5 --verbose --log-level=DEBUG  --host-discovery-script /etc/edl/discover_hosts.sh python /horovod/examples/elastic/pytorch/pytorch_synthetic_benchmark_elastic.py
65              env:
66              image: registry.baidubce.com/cce-plugin-dev/horovod:v0.1.0
67              imagePullPolicy: Always
68              name: aitj-0
69              resources:
70              securityContext:
71                capabilities:
72                  add:
73                    - SYS_ADMIN
74              volumeMounts:
75                - mountPath: /dev/shm
76                  name: cache-volume
77          dnsPolicy: ClusterFirstWithHostNet
78          terminationGracePeriodSeconds: 30
79          volumes:
80            - emptyDir:
81                medium: Memory
82                sizeLimit: 1449Gi
83              name: cache-volume
84    trainer:
85      completePolicy: None
86      failPolicy: None
87# Fault tolerance configuration. The controller will use the following configuration as the fault tolerance judgment condition for fault tolerance
88      faultTolerantPolicy:
89# Program exit code
90        - exitCodes: 129,10001,127,137,143,129
91          restartPolicy: ExitCode
92          restartScope: Pod
93# Cluster exception events
94        - exceptionalEvent: "nodeNotReady,PodForceDeleted"
95          restartPolicy: OnNodeFail
96          restartScope: Pod
97# Maximum number of replicas when elasticity is enabled
98      maxReplicas: 5
99# Minimum number of replicas when elasticity is enabled
100      minReplicas: 1
101      replicaType: worker
102      replicas: 3
103      restartLimit: 100
104      restartPolicy: OnNodeFailWithExitCode
105      restartTimeLimit: 60
106      restartTimeout: 864000
107      template:
108        metadata:
109          creationTimestamp: null
110        spec:
111          containers:
112            - command:
113                - /bin/bash
114                - -c
115                - /usr/sbin/sshd && sleep 40000
116              image: registry.baidubce.com/cce-plugin-dev/horovod:v0.1.0
117              imagePullPolicy: Always
118              name: aitj-0
119              resources:
120# Limit and request must be consistent
121                limits:
122                  baidu/gpu_p40_8: "1"
123                requests:
124                  baidu/gpu_p40_8: "1"
125              securityContext:
126                capabilities:
127                  add:
128                    - SYS_ADMIN
129              volumeMounts:
130                - mountPath: /dev/shm
131                  name: cache-volume
132          dnsPolicy: ClusterFirstWithHostNet
133          terminationGracePeriodSeconds: 300
134          volumes:
135            - emptyDir:
136                medium: Memory
137                sizeLimit: 1449Gi
138              name: cache-volume
139  schedulerName: volcano

Create PaddlePaddle Task

Delete task

CCE CCE

CCE CCE

Create AI Training Task

Prerequisites

Operation steps

Example of creating a task with YAML