Create Pytorch Task
Last Updated:2022-01-14
You can create a Pytorch task.
Prerequisites
- You already install the CCE AI Job Scheduler and CCE Deep Learning Frameworks Operator components successfully. Otherwise, the cloud native AI feature is unavailable.
- If you are a sub-user, you can only use the queue to create a task if you are among the users associated with the queue.
- During the installation of the CCE Deep Learning Frameworks component, the system is installed with the Pytorch deep learning framework.
Operation Steps
- Log in to the Baidu AI Cloud Official Website, and then enter the management console.
- Select “Product Service > Cloud Native > CCE”, and click CCE to enter the container engine management console.
- Click Cluster Management > Cluster List in the navbar on the left side.
- On the cluster list page, click the target cluster name to enter the cluster management page.
- On the cluster management page, click Cloud Native AI > Task Management.
- On the task management page, click Create Task.
- On the basic information, complete the configuration of the task.
- Task name: Customize the task name, which supports uppercase and lowercase letters, numbers, -, _, /, ., and other special characters, must start with a Chinese character or letter and have a length of 1-65 characters.
- Queue: Select the queue associated with the new task.
- Framework: Select the deep learning framework "Pytorch" corresponding to the task.
- Complete the configuration by referring to the yaml template:
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "pytorch-dist-mnist-gloo"
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
# if your libcuda.so.1 is in custom path, set the correct path with the following annotation
# kubernetes.io/baidu-cgpu.nvidia-driver-lib: /usr/lib64
spec:
schedulerName: volcano
containers:
- name: pytorch
image: registry.baidubce.com/cce-public/kubeflow/pytorch-dist-mnist-test-with-data:1.0
args: ["--backend", "gloo"]
# Comment out the below resources to use the CPU.
resources:
requests:
cpu: 1
memory: 1Gi
limits:
baidu.com/v100_32g_cgpu: "1"
# for gpu core/memory isolation
baidu.com/v100_32g_cgpu_core: 10
baidu.com/v100_32g_cgpu_memory: "2"
# if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
# `mnist.py` needs to be replaced with the name of your gpu process.
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "kill -10 `ps -ef | grep mnist.py | grep -v grep | awk '{print $2}'` && sleep 1"]
Worker:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
schedulerName: volcano
containers:
- name: pytorch
image: registry.baidubce.com/cce-public/kubeflow/pytorch-dist-mnist-test-with-data:1.0
args: ["--backend", "gloo"]
resources:
requests:
cpu: 1
memory: 1Gi
limits:
baidu.com/v100_32g_cgpu: "1"
# for gpu core/memory isolation
baidu.com/v100_32g_cgpu_core: 20
baidu.com/v100_32g_cgpu_memory: "4"
# if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
# `mnist.py` needs to be replaced with the name of your gpu process.
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "kill -10 `ps -ef | grep mnist.py | grep -v grep | awk '{print $2}'` && sleep 1"]
- Click the “OK” button to complete the task creation.