Create PaddlePaddle Task

Updated at：2025-10-27

You can create a new task specifically of the PaddlePaddle type.

Prerequisites

You have successfully installed the CCE AI Job Scheduler and CCE Deep Learning Frameworks Operator components; without these, the cloud-native AI features will be unavailable.
As an IAM user, you can only use a queue to create new tasks if you are part of the users linked to that queue.
Installing the CCE Deep Learning Frameworks Operator component will also install the PaddlePaddle deep learning framework.

Limitations

PaddlePaddle tasks currently don’t support sharing GPU memory.

Operation steps

Sign in to the Baidu AI Cloud official website and enter the management console.
Go to Product Services - Cloud Native - Cloud Container Engine (CCE) to access the CCE management console.
Click Cluster Management - Cluster List in the left navigation pane.
Click on the target cluster name in the Cluster List page to navigate to the cluster management page.
On the Cluster Management page, click Cloud-Native AI - Task Management.
Click Create Task on the Task Management page.
On the Create Task page, configure basic task information:

Screenshot 5/31/2024 5.27.38 PM.png

Specify a custom task name using lowercase letters, numbers, “-”, or “.”. The name must start and end with a lowercase letter or number and be 1-65 characters long.
Namespace: Choose the namespace for the new task.
Queue: Choose the corresponding queue for the new task.
Task priority: Set the priority level for the task.
Allow overcommitment: Enable this option to use task preemption for overcommitment. The CCE AI Job Scheduler component must be installed and updated to version 1.4.0 or higher.
Tolerance for delay: The system will prioritize scheduling tasks or workloads to fragmented cluster resources to enhance cluster resource utilization, though this might impact business latency performance.

Configure basic code information:

Screenshot 5/31/2024 5.32.15 PM.png

Code configuration type: Specify the code configuration method. Current options include “BOS File,” “Local File Upload,” and “Not Configured Temporarily.”
Execution command: Define the command to execute the code.

Configure data-related information:

Screenshot 5/31/2024 5.33.38 PM.png

Set Data Source: Supports both datasets and persistent volume claims (PVCs). For datasets: All available datasets are listed, and selecting a dataset will automatically select a PVC with the same name. For PVCs: Directly select the desired PVC.

Click "Next" to proceed to container-related configurations.
Configure task type information:

Select Framework: Choose PaddleJob.
Training method: Select either Single-Machine or Distributed training.
Select Role: For "Single-machine" training, only "Worker" can be chosen. For "Distributed" training, the "PS" role can also be selected.

Configure pod information (advanced settings are optional).

Screenshot 5/31/2024 5.41.49 PM.png

Specify the number of pods desired in the pod.
Define the restart policy for the pod. Options: “Restart on Failure” or “Never Restart”.
Provide the address for pulling the container image. Alternatively, click Select Image to choose the desired image.
Enter the image version. If left unspecified, the latest version will be used by default.
Set the CPU, memory, and GPU resource requirements for the container.
Environment Variables: Enter the variable names and their corresponding values.
Lifecycle: Includes start commands, parameters, actions after startup, and actions before stopping, all of which can be customized as needed.

Configure the advanced task settings.

Screenshot 5/31/2024 5.43.32 PM.png

Set the maximum allowable training duration (leave blank for unlimited duration).
Add credentials to access the private image registry if using a private image.
Tensorboard: If task visualization is required, the Tensorboard function can be enabled. After enabling, you need to specify the “Service Type” and “ Training Log Reading Path”.
Assign K8s labels to the task.
Provide annotations for the task.

Click the Finish button to finalize task creation.

Example of creating a task with YAML

Plain Text

1apiVersion: batch.paddlepaddle.org/v1
2kind: PaddleJob
3metadata:
4  name: resnet
5spec:
6  cleanPodPolicy: Never
7  worker:
8    replicas: 2
9    template:
10      spec:
11        schedulerName: volcano
12        containers:
13          - name: resnet
14            image: registry.baidubce.com/cce-public/kubeflow/paddle-operator/demo-resnet:v1
15            env:
16              # for gpu memory over request, set 0 to disable
17              - name: CGPU_MEM_ALLOCATOR_TYPE
18                value: “1”
19            command:
20              - python
21            args:
22              - "-m"
23              - "paddle.distributed.launch"
24              - "train_fleet.py"
25            volumeMounts:
26              - mountPath: /dev/shm
27                name: dshm
28            resources:
29              requests:
30                cpu: 1
31                memory: 2Gi
32              limits:
33                baidu.com/v100_16g_cgpu: "1"
34        volumes:
35          - name: dshm
36            emptyDir:
37              medium: Memory

Example of RDMA Distributed Training Based on NCCL

Create AI Training Task

CCE CCE