Create PyTorch Task

Updated at：2025-10-27

You can create a new task specifically of the PyTorch type.

Prerequisites

You have successfully installed the CCE AI Job Scheduler and CCE Deep Learning Frameworks Operator components; without these, the cloud-native AI features will be unavailable.
As an IAM user, you can only use a queue to create new tasks if you are part of the users linked to that queue.
When you install the CCE Deep Learning Frameworks Operator component, the PyTorch deep learning framework is installed automatically.

Operation steps

Sign in to the Baidu AI Cloud official website and enter the management console.
Go to Product Services - Cloud Native - Cloud Container Engine (CCE) to access the CCE management console.
Click Cluster Management - Cluster List in the left navigation pane.
Click on the target cluster name in the Cluster List page to navigate to the cluster management page.
On the Cluster Management page, click Cloud-Native AI - Task Management.
Click Create Task on the Task Management page.
On the Create Task page, configure basic task information:

Screenshot 5/31/2024 5.27.38 PM.png

Specify a custom task name using lowercase letters, numbers, “-”, or “.”. The name must start and end with a lowercase letter or number and be 1-65 characters long.
Namespace: Choose the namespace for the new task.
Queue: Choose the corresponding queue for the new task.
Task priority: Set the priority level for the task.
Allow overcommitment: Enable this option to use task preemption for overcommitment. The CCE AI Job Scheduler component must be installed and updated to version 1.4.0 or higher.
Tolerance for delay: The system will prioritize scheduling tasks or workloads to fragmented cluster resources to enhance cluster resource utilization, though this might impact business latency performance.

Configure basic code information:

Screenshot 5/31/2024 5.32.15 PM.png

Code Configuration Type: Choose the method for code configuration, currently supporting "BOS File," "Local File Upload," "Git Code Repository," and "Not Configured Temporarily."
Execution command: Define the command to execute the code.

Configure data-related information:

Screenshot 5/31/2024 5.33.38 PM.png

Set data source: Currently supports datasets, persistent volume claims, temporary paths, and host paths. For datasets: All available datasets are listed; selecting a dataset automatically matches the PVC with the same name. For persistent volume claims: Choose the persistent volume claim directly.

Click "Next" to proceed to container-related configurations.
Configure task type information:

Screenshot 5/31/2024 5.52.23 PM.png

Select Framework: Choose PyTorch.
Training method: Select either Single-Machine or Distributed training.
Select Role: For "Single-machine" training, only "Worker" can be chosen. For "Distributed" training, additional roles such as "PS," "Chief," and "Evaluator" can be selected.

Configure pod information (advanced settings are optional).

Screenshot 5/31/2024 5.41.49 PM.png

Specify the number of pods desired in the pod.
Define the restart policy for the pod. Options: “Restart on Failure” or “Never Restart”.
Provide the address for pulling the container image. Alternatively, click Select Image to choose the desired image.
Enter the image version. If left unspecified, the latest version will be used by default.
Container Quota: Specify the CPU, memory, and GPU/NPU resource allocation for the container.
Environment Variables: Enter the variable names and their corresponding values.
Lifecycle: Includes start commands, parameters, actions after startup, and actions before stopping, all of which can be customized as needed.

Configure the advanced task settings.

Screenshot 5/31/2024 5.43.32 PM.png

Add credentials to access the private image registry if using a private image.
Tensorboard: If task visualization is required, the Tensorboard function can be enabled. After enabling, you need to specify the “Service Type” and “ Training Log Reading Path”.
Assign K8s labels to the task.
Provide annotations for the task.

Click the Finish button to finalize task creation.

Example of creating a task with YAML

                JSON
                
            

                apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-gloo"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
            # if your libcuda.so.1 is in custom path, set the correct path with the following annotation
            # kubernetes.io/baidu-cgpu.nvidia-driver-lib: /usr/lib64
        spec:
          schedulerName: volcano
          containers:
            - name: pytorch
              image: registry.baidubce.com/cce-public/kubeflow/pytorch-dist-mnist-test-with-data:1.0
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              resources:
                requests:
                  cpu: 1
                  memory: 1Gi
                limits:
                  baidu.com/v100_32g_cgpu: "1"
                  # for gpu core/memory isolation
                  baidu.com/v100_32g_cgpu_core: 10
                  baidu.com/v100_32g_cgpu_memory: "2"
              # if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
              # `mnist.py` needs to be replaced with the name of your gpu process.
              lifecycle:
                preStop:
                  exec:
                    command: [
                      "/bin/sh", "-c",
                      "kill -10 `ps -ef | grep mnist.py | grep -v grep | awk '{print $2}'` && sleep 1"
                    ]
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          schedulerName: volcano
          containers:
            - name: pytorch
              image: registry.baidubce.com/cce-public/kubeflow/pytorch-dist-mnist-test-with-data:1.0
              env:
                # for gpu memory over request, set 0 to disable
                - name: CGPU_MEM_ALLOCATOR_TYPE
                  value: “1”
              args: ["--backend", "gloo"]
              resources:
                requests:
                  cpu: 1
                  memory: 1Gi
                limits:
                  baidu.com/v100_32g_cgpu: "1"
                  # for gpu core/memory isolation
                  baidu.com/v100_32g_cgpu_core: 20
                  baidu.com/v100_32g_cgpu_memory: "4"
              # if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
              # `mnist.py` needs to be replaced with the name of your gpu process.
              lifecycle:
                preStop:
                  exec:
                    command: [
                      "/bin/sh", "-c",
                      "kill -10 `ps -ef | grep mnist.py | grep -v grep | awk '{print $2}'` && sleep 1"
                    ]
            

Delete task

Create Mxnet Task

CCE CCE

CCE CCE

Create PyTorch Task

Prerequisites

Operation steps

Example of creating a task with YAML