Create Mxnet Task
Last Updated:2022-01-14
You can create a Mxnet task.
Prerequisites
- You already install the CCE AI Job Scheduler and CCE Deep Learning Frameworks Operator components successfully. Otherwise, the cloud native AI feature is unavailable.
- If you are a sub-user, you can only use the queue to create a task if you are among the users associated with the queue.
- During the installation of the component CCE Deep Learning Frameworks Operator, the system is installed with the Mxnet deep learning framework.
Operation Steps
- Log in to Baidu AI Cloud Official Website, and then enter the management console.
- Select “Product Service > Cloud Native > CCE”, and click CCE to enter the container engine management console.
- Click Cluster Management > Cluster List in the navbar on the left side.
- On the cluster list page, click the target cluster name to enter the cluster management page.
- On the cluster management page, click Cloud Native AI > Task Management.
- On the task management page, click Create Task.
- On the basic information, complete the configuration of the task.
- Task name: Customize the task name, which supports uppercase and lowercase letters, numbers, -, _, /, ., and other special characters, must start with a Chinese character or letter and have a length of 1-65 characters.
- Queue: Select the queue associated with the new task.
- Framework: Select the deep learning framework "Mxnet" corresponding to the task.
- Complete the configuration by referring to the yaml template:
apiVersion: "kubeflow.org/v1"
kind: "MXJob"
metadata:
name: "mxnet-job"
spec:
jobMode: MXTrain
mxReplicaSpecs:
Scheduler:
replicas: 1
restartPolicy: Never
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
# if your libcuda.so.1 is in custom path, set the correct path with the following annotation
# kubernetes.io/baidu-cgpu.nvidia-driver-lib: /usr/lib64
spec:
schedulerName: volcano
containers:
- name: mxnet
image: registry.baidubce.com/cce-public/mxjob/mxnet:gpu
resources:
limits:
baidu.com/v100_32g_cgpu: "1"
# for gpu core/memory isolation
baidu.com/v100_32g_cgpu_core: 5
baidu.com/v100_32g_cgpu_memory: "1"
# if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
# `train_mnist.py` needs to be replaced with the name of your gpu process.
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "kill -10 `ps -ef | grep train_mnist.py | grep -v grep | awk '{print $2}'` && sleep 1"]
Server:
replicas: 1
restartPolicy: Never
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
# if your libcuda.so.1 is in custom path, set the correct path with the following annotation
# kubernetes.io/baidu-cgpu.nvidia-driver-lib: /usr/lib64
spec:
schedulerName: volcano
containers:
- name: mxnet
image: registry.baidubce.com/cce-public/mxjob/mxnet:gpu
resources:
limits:
baidu.com/v100_32g_cgpu: "1"
# for gpu core/memory isolation
baidu.com/v100_32g_cgpu_core: 5
baidu.com/v100_32g_cgpu_memory: "1"
# if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
# `train_mnist.py` needs to be replaced with the name of your gpu process.
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "kill -10 `ps -ef | grep train_mnist.py | grep -v grep | awk '{print $2}'` && sleep 1"]
Worker:
replicas: 1
restartPolicy: Never
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
# if your libcuda.so.1 is in custom path, set the correct path with the following annotation
# kubernetes.io/baidu-cgpu.nvidia-driver-lib: /usr/lib64
spec:
schedulerName: volcano
containers:
- name: mxnet
image: registry.baidubce.com/cce-public/mxjob/mxnet:gpu
command: ["python"]
args: ["/incubator-mxnet/example/image-classification/train_mnist.py","--num-epochs","10","--num-layers","2","--kv-store","dist_device_sync","--gpus","0"]
resources:
requests:
cpu: 1
memory: 1Gi
limits:
baidu.com/v100_32g_cgpu: "1"
# for gpu core/memory isolation
baidu.com/v100_32g_cgpu_core: 20
baidu.com/v100_32g_cgpu_memory: "4"
# if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
# `train_mnist.py` needs to be replaced with the name of your gpu process.
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "kill -10 `ps -ef | grep train_mnist.py | grep -v grep | awk '{print $2}'` && sleep 1"]
- Click the “OK” button to complete the task creation.