Deep Learning Job

Last Updated：2020-10-16

Deep learning job integrates various open source deep learning frameworks. Users can use different frameworks, write codes for multiple rounds of training and iteration, and upload the generated models and various data to BOS storage.

Create a Job

In training job, multiple resource packages and GPU resources of different models are provided to unify resource scheduling, thus improving training speed.

Select "Training > Deep learning job" in the left navigation bar to enter the deep learning job list page. Click "Create Job" to enter the new job process.

When creating a job, you need to submit the running code and complete the corresponding configuration.

To submit the running code, you can input the code in two ways:

Edit code directly: copy the debugged code directly to the code edit box to initiate a job.
Select code file: upload the code to BOS, fill in the code file path on BOS and initiate cluster job.

When you select edit code directly, you can input the code directly into the code edit box.

In addition, you can click "Select code template". Here we provide some code templates for your reference. Note, however, that the selected code template overwrites the code in the code editing area.
Select "Select code file", select the BOS path of code storage to complete code input.

Job Configuration Item

Configuration name	Required	Description
Job name	Yes	It can only consist of numbers, letters or - and the first can only be a letter
Algorithm or framework	Yes	Support TensorFlow v1.13.1, Python V1.1.0 and PaddlePaddle v1.4.0
Whether to send SMS at the end of job	Yes	Whether to send SMS at the end of job
Output path	Yes	The path where the model output and logs are stored. Put the trained model and data into the output directory of the container, and the platform will automatically upload the contents of the output directory of the container to the path /{job_id}/output, and upload the logs to the path/{job_id}/log
Training data path	No	The platform will automatically download the data under this path to the local train_data directory under the container environment. If the job has multiple containers, each container will only be assigned to download part of the data
Testing data path	No	The platform will automatically download the data under this path to the local test_data directory under the container environment. If the job has multiple containers, each container is assigned to download only part of the data
Computing resources	Yes	BML cluster (or your private CCE cluster)
Resource package	Yes	It includes CPU instance _2 core _4GB memory, CPU instance _8 core _32GB memory, GPU instance_deep learning development card_6 core 40GB memory x1 card, GPU instance _K40_6 core 40GB memory x1 card, GPU instance_V100_6 core 40GB memory x1 card, etc
Number of instances	Yes	Multi-machine configuration
Maximum running time	Yes	After the job runs beyond the maximum running time, it will automatically terminate the job, which may result in no results being generated.

For jobs that have been submitted, you can do the following:

Terminate: terminate a job that is currently running or queued. After termination of operation, the job results and job logs will not be uploaded to the specified BOS path.
Clone: clone the code and configuration items of a job to enter the initiate job page.
Delete: Delete this job. If the job is still in queue or running at the time of deletion, the queue or running will be terminated before deleting the job.
View job details: click job name to enter job details, view job configuration information, job code, and job operation details.
Job operation details: view the current job operation status and startup/end time.
Resource information list: view the running status of the container used by the current job and the running log. In the running job, you can directly view the running log. For jobs that have finished running, a redirect bos address and a download link to store the running log will be provided for viewing or downloading the running log.
View log analysis: when there is an error in job execution, you can view the log analysis of the error job here.

View Job Results

After the job runs, the training results and running logs will be stored in the corresponding BOS address according to the output result storage path specified during job configuration and the log storage path.

Go to BOS to view or download the running results of the job, and directly view or download the running log by using the redirect BOS address and download link provided to store the running log. In two cases, the job results and job logs cannot be saved:

Terminate the job manually;
The job is terminated automatically when it runs out of time.

Notebook Modeling

Machine Learning Job

百度智能云

Baidu Machine Learning

Deep Learning Job

Create a Job

Job Configuration Item

View Job Results

Baidu Machine Learning

Deep Learning Job

Create a Job

Job Configuration Item

Job Management Related Operations

View Job Results