百度智能云

All Product Document

          Baidu Machine Learning

          Deep Learning Job

          Deep learning job integrates various open source deep learning frameworks. Users can use different frameworks, write codes for multiple rounds of training and iteration, and upload the generated models and various data to BOS storage.

          Create a Job

          In training job, multiple resource packages and GPU resources of different models are provided to unify resource scheduling, thus improving training speed.

          Select "Training > Deep learning job" in the left navigation bar to enter the deep learning job list page. Click "Create Job" to enter the new job process.

          When creating a job, you need to submit the running code and complete the corresponding configuration.

          To submit the running code, you can input the code in two ways:

          1. Edit code directly: copy the debugged code directly to the code edit box to initiate a job.
          2. Select code file: upload the code to BOS, fill in the code file path on BOS and initiate cluster job.
          • When you select edit code directly, you can input the code directly into the code edit box.

            In addition, you can click "Select code template". Here we provide some code templates for your reference. Note, however, that the selected code template overwrites the code in the code editing area.

          • Select "Select code file", select the BOS path of code storage to complete code input.

          Job Configuration Item

          Configuration name Required Description
          Job name Yes It can only consist of numbers, letters or - and the first can only be a letter
          Algorithm or framework Yes Support TensorFlow v1.13.1, Python V1.1.0 and PaddlePaddle v1.4.0
          Whether to send SMS at the end of job Yes Whether to send SMS at the end of job
          Output path Yes The path where the model output and logs are stored. Put the trained model and data into the output directory of the container, and the platform will automatically upload the contents of the output directory of the container to the path /{job_id}/output, and upload the logs to the path/{job_id}/log
          Training data path No The platform will automatically download the data under this path to the local train_data directory under the container environment. If the job has multiple containers, each container will only be assigned to download part of the data
          Testing data path No The platform will automatically download the data under this path to the local test_data directory under the container environment. If the job has multiple containers, each container is assigned to download only part of the data
          Computing resources Yes BML cluster (or your private CCE cluster)
          Resource package Yes It includes CPU instance _2 core _4GB memory, CPU instance _8 core _32GB memory, GPU instance_deep learning development card_6 core 40GB memory x1 card, GPU instance _K40_6 core 40GB memory x1 card, GPU instance_V100_6 core 40GB memory x1 card, etc
          Number of instances Yes Multi-machine configuration
          Maximum running time Yes After the job runs beyond the maximum running time, it will automatically terminate the job, which may result in no results being generated.

          For jobs that have been submitted, you can do the following:

          • Terminate: terminate a job that is currently running or queued. After termination of operation, the job results and job logs will not be uploaded to the specified BOS path.
          • Clone: clone the code and configuration items of a job to enter the initiate job page.
          • Delete: Delete this job. If the job is still in queue or running at the time of deletion, the queue or running will be terminated before deleting the job.
          • View job details: click job name to enter job details, view job configuration information, job code, and job operation details.
          • Job operation details: view the current job operation status and startup/end time.
          • Resource information list: view the running status of the container used by the current job and the running log. In the running job, you can directly view the running log. For jobs that have finished running, a redirect bos address and a download link to store the running log will be provided for viewing or downloading the running log.
          • View log analysis: when there is an error in job execution, you can view the log analysis of the error job here.

          View Job Results

          After the job runs, the training results and running logs will be stored in the corresponding BOS address according to the output result storage path specified during job configuration and the log storage path.

          Go to BOS to view or download the running results of the job, and directly view or download the running log by using the redirect BOS address and download link provided to store the running log. In two cases, the job results and job logs cannot be saved:

          1. Terminate the job manually;
          2. The job is terminated automatically when it runs out of time.
          Previous
          Notebook Modeling
          Next
          Machine Learning Job