Create PyTorch Task
Updated at:2025-10-27
You can create a new task specifically of the PyTorch type.
Prerequisites
- You have successfully installed the CCE AI Job Scheduler and CCE Deep Learning Frameworks Operator components; without these, the cloud-native AI features will be unavailable.
- As an IAM user, you can only use a queue to create new tasks if you are part of the users linked to that queue.
- When you install the CCE Deep Learning Frameworks Operator component, the PyTorch deep learning framework is installed automatically.
Operation steps
- Sign in to the Baidu AI Cloud official website and enter the management console.
- Go to Product Services - Cloud Native - Cloud Container Engine (CCE) to access the CCE management console.
- Click Cluster Management - Cluster List in the left navigation pane.
- Click on the target cluster name in the Cluster List page to navigate to the cluster management page.
- On the Cluster Management page, click Cloud-Native AI - Task Management.
- Click Create Task on the Task Management page.
- On the Create Task page, configure basic task information:

- Specify a custom task name using lowercase letters, numbers, “-”, or “.”. The name must start and end with a lowercase letter or number and be 1-65 characters long.
- Namespace: Choose the namespace for the new task.
- Queue: Choose the corresponding queue for the new task.
- Task priority: Set the priority level for the task.
- Allow overcommitment: Enable this option to use task preemption for overcommitment. The CCE AI Job Scheduler component must be installed and updated to version 1.4.0 or higher.
- Tolerance for delay: The system will prioritize scheduling tasks or workloads to fragmented cluster resources to enhance cluster resource utilization, though this might impact business latency performance.
- Configure basic code information:

- Code Configuration Type: Choose the method for code configuration, currently supporting "BOS File," "Local File Upload," "Git Code Repository," and "Not Configured Temporarily."
- Execution command: Define the command to execute the code.
- Configure data-related information:

- Set data source: Currently supports datasets, persistent volume claims, temporary paths, and host paths. For datasets: All available datasets are listed; selecting a dataset automatically matches the PVC with the same name. For persistent volume claims: Choose the persistent volume claim directly.
- Click "Next" to proceed to container-related configurations.
- Configure task type information:

- Select Framework: Choose PyTorch.
- Training method: Select either Single-Machine or Distributed training.
- Select Role: For "Single-machine" training, only "Worker" can be chosen. For "Distributed" training, additional roles such as "PS," "Chief," and "Evaluator" can be selected.
- Configure pod information (advanced settings are optional).

- Specify the number of pods desired in the pod.
- Define the restart policy for the pod. Options: “Restart on Failure” or “Never Restart”.
- Provide the address for pulling the container image. Alternatively, click Select Image to choose the desired image.
- Enter the image version. If left unspecified, the latest version will be used by default.
- Container Quota: Specify the CPU, memory, and GPU/NPU resource allocation for the container.
- Environment Variables: Enter the variable names and their corresponding values.
- Lifecycle: Includes start commands, parameters, actions after startup, and actions before stopping, all of which can be customized as needed.
- Configure the advanced task settings.

- Add credentials to access the private image registry if using a private image.
- Tensorboard: If task visualization is required, the Tensorboard function can be enabled. After enabling, you need to specify the “Service Type” and “ Training Log Reading Path”.
- Assign K8s labels to the task.
- Provide annotations for the task.
- Click the Finish button to finalize task creation.
Example of creating a task with YAML
JSON
1apiVersion: "kubeflow.org/v1"
2kind: "PyTorchJob"
3metadata:
4 name: "pytorch-dist-mnist-gloo"
5spec:
6 pytorchReplicaSpecs:
7 Master:
8 replicas: 1
9 restartPolicy: OnFailure
10 template:
11 metadata:
12 annotations:
13 sidecar.istio.io/inject: "false"
14 # if your libcuda.so.1 is in custom path, set the correct path with the following annotation
15 # kubernetes.io/baidu-cgpu.nvidia-driver-lib: /usr/lib64
16 spec:
17 schedulerName: volcano
18 containers:
19 - name: pytorch
20 image: registry.baidubce.com/cce-public/kubeflow/pytorch-dist-mnist-test-with-data:1.0
21 args: ["--backend", "gloo"]
22 # Comment out the below resources to use the CPU.
23 resources:
24 requests:
25 cpu: 1
26 memory: 1Gi
27 limits:
28 baidu.com/v100_32g_cgpu: "1"
29 # for gpu core/memory isolation
30 baidu.com/v100_32g_cgpu_core: 10
31 baidu.com/v100_32g_cgpu_memory: "2"
32 # if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
33 # `mnist.py` needs to be replaced with the name of your gpu process.
34 lifecycle:
35 preStop:
36 exec:
37 command: [
38 "/bin/sh", "-c",
39 "kill -10 `ps -ef | grep mnist.py | grep -v grep | awk '{print $2}'` && sleep 1"
40 ]
41 Worker:
42 replicas: 1
43 restartPolicy: OnFailure
44 template:
45 metadata:
46 annotations:
47 sidecar.istio.io/inject: "false"
48 spec:
49 schedulerName: volcano
50 containers:
51 - name: pytorch
52 image: registry.baidubce.com/cce-public/kubeflow/pytorch-dist-mnist-test-with-data:1.0
53 env:
54 # for gpu memory over request, set 0 to disable
55 - name: CGPU_MEM_ALLOCATOR_TYPE
56 value: “1”
57 args: ["--backend", "gloo"]
58 resources:
59 requests:
60 cpu: 1
61 memory: 1Gi
62 limits:
63 baidu.com/v100_32g_cgpu: "1"
64 # for gpu core/memory isolation
65 baidu.com/v100_32g_cgpu_core: 20
66 baidu.com/v100_32g_cgpu_memory: "4"
67 # if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
68 # `mnist.py` needs to be replaced with the name of your gpu process.
69 lifecycle:
70 preStop:
71 exec:
72 command: [
73 "/bin/sh", "-c",
74 "kill -10 `ps -ef | grep mnist.py | grep -v grep | awk '{print $2}'` && sleep 1"
75 ]
