百度智能云

All Product Document

          Cloud Container Engine

          Create Mxnet Task

          You can create a Mxnet task.

          Prerequisites

          • You already install the CCE AI Job Scheduler and CCE Deep Learning Frameworks Operator components successfully. Otherwise, the cloud native AI feature is unavailable.
          • If you are a sub-user, you can only use the queue to create a task if you are among the users associated with the queue.
          • During the installation of the component CCE Deep Learning Frameworks Operator, the system is installed with the Mxnet deep learning framework.

          Operation Steps

          1. Log in to Baidu AI Cloud Official Website, and then enter the management console.
          2. Select “Product Service > Cloud Native > CCE”, and click CCE to enter the container engine management console.
          3. Click Cluster Management > Cluster List in the navbar on the left side.
          4. On the cluster list page, click the target cluster name to enter the cluster management page.
          5. On the cluster management page, click Cloud Native AI > Task Management.
          6. On the task management page, click Create Task.
          7. On the basic information, complete the configuration of the task.

          image.png

          • Task name: Customize the task name, which supports uppercase and lowercase letters, numbers, -, _, /, ., and other special characters, must start with a Chinese character or letter and have a length of 1-65 characters.
          • Queue: Select the queue associated with the new task.
          • Framework: Select the deep learning framework "Mxnet" corresponding to the task.
          1. Complete the configuration by referring to the yaml template:
          apiVersion: "kubeflow.org/v1"
          kind: "MXJob"
          metadata:
            name: "mxnet-job"
          spec:
            jobMode: MXTrain
            mxReplicaSpecs:
              Scheduler:
                replicas: 1
                restartPolicy: Never
                template:
                  metadata:
                    annotations:
                      sidecar.istio.io/inject: "false"
                      # if your libcuda.so.1 is in custom path, set the correct path with the following annotation
                      # kubernetes.io/baidu-cgpu.nvidia-driver-lib: /usr/lib64
                  spec:
                    schedulerName: volcano
                    containers:
                      - name: mxnet
                        image: registry.baidubce.com/cce-public/mxjob/mxnet:gpu
                        resources:
                          limits:
                            baidu.com/v100_32g_cgpu: "1"
                            # for gpu core/memory isolation
                            baidu.com/v100_32g_cgpu_core: 5
                            baidu.com/v100_32g_cgpu_memory: "1"
                        # if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
                        # `train_mnist.py` needs to be replaced with the name of your gpu process.
                        lifecycle:
                          preStop:
                            exec:
                              command: ["/bin/sh", "-c", "kill -10 `ps -ef | grep train_mnist.py | grep -v grep | awk '{print $2}'` && sleep 1"]
              Server:
                replicas: 1
                restartPolicy: Never
                template:
                  metadata:
                    annotations:
                      sidecar.istio.io/inject: "false"
                      # if your libcuda.so.1 is in custom path, set the correct path with the following annotation
                      # kubernetes.io/baidu-cgpu.nvidia-driver-lib: /usr/lib64
                  spec:
                    schedulerName: volcano
                    containers:
                      - name: mxnet
                        image: registry.baidubce.com/cce-public/mxjob/mxnet:gpu
                        resources:
                          limits:
                            baidu.com/v100_32g_cgpu: "1"
                            # for gpu core/memory isolation
                            baidu.com/v100_32g_cgpu_core: 5
                            baidu.com/v100_32g_cgpu_memory: "1"
                        # if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
                        # `train_mnist.py` needs to be replaced with the name of your gpu process.
                        lifecycle:
                          preStop:
                            exec:
                              command: ["/bin/sh", "-c", "kill -10 `ps -ef | grep train_mnist.py | grep -v grep | awk '{print $2}'` && sleep 1"]
              Worker:
                replicas: 1
                restartPolicy: Never
                template:
                  metadata:
                    annotations:
                      sidecar.istio.io/inject: "false"
                      # if your libcuda.so.1 is in custom path, set the correct path with the following annotation
                      # kubernetes.io/baidu-cgpu.nvidia-driver-lib: /usr/lib64
                  spec:
                    schedulerName: volcano
                    containers:
                    - name: mxnet
                      image: registry.baidubce.com/cce-public/mxjob/mxnet:gpu
                      command: ["python"]
                      args: ["/incubator-mxnet/example/image-classification/train_mnist.py","--num-epochs","10","--num-layers","2","--kv-store","dist_device_sync","--gpus","0"]
                      resources:
                        requests:
                          cpu: 1
                          memory: 1Gi
                        limits:
                          baidu.com/v100_32g_cgpu: "1"
                          # for gpu core/memory isolation
                          baidu.com/v100_32g_cgpu_core: 20
                          baidu.com/v100_32g_cgpu_memory: "4"
                      # if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
                      # `train_mnist.py` needs to be replaced with the name of your gpu process.
                      lifecycle:
                        preStop:
                          exec:
                            command: ["/bin/sh", "-c", "kill -10 `ps -ef | grep train_mnist.py | grep -v grep | awk '{print $2}'` && sleep 1"]
          1. Click the “OK” button to complete the task creation.
          Previous
          Queue Management
          Next
          Create PaddlePaddle Task