Machine Learning Job

Last Updated：2020-10-16

The machine learning job includes many efficient, and well-proven machine learning algorithms and open-source machine learning algorithms of RAPIDS-cuML GPU version, which are independently developed by Baidu. Among them, BML's efficient distributed computing capability facilitates the user to achieve their work objectives even if there are massive data. These algorithms apply to statistics and analysis of massive data, data mining, model training, business intelligence, and other fields. With the RAPIDS-cuML, the developer can run traditional ML jobs on the GPU without having to learn more about the CUDA programming details.

With the BML algorithm, the user's data must first run the "data standardization" algorithm. The main objective of the data standardization is to achieve the ID of the data samples and reduce the memory occupation in the training process. You can use different algorithms for access to the standardized training data, such as logistic regression dichotomy, and xgboost. That's to say, you can use the standardized data as the input training data for various algorithms. With the RAPIDS-cuML algorithm, you can call cuml and cudf libraries and submit training jobs by directly editing the codes or selecting the code files. You can run the job on a GPU machine. Currently, the BML only supports the single-machine and single-card operation.

Create a Job

On the left navigation bar, select "training--> machine learning jobs" to enter the list page of machine learning jobs. Click the "Create Job" button to enter the new job process.

Data Standardization

The main objective of the BML data standardization is to achieve the ID conversion of the data samples and reduce the memory allocation in the training process. After the user inputs the training data meeting the specified format, the platform counts the number of samples and features, converts the string feature into int numbers, and finally outputs the int numbers to the user's ID-converted data. The user can use this ID-converted data as input for various machine learning algorithms, such as LR, and xgboost. The data standardization is illustrated here so that you can use the standardized data as the input for various algorithms or multiple parameter adjustments. Thus, it is possible to reduce the tedious process of standardizing data for each training.

Configuration Descriptions:

Configuration Name	Required	Description
Job Name	Yes	It can only consist of numbers, letters,-or_ and can only start with letters, with a length of less than 40 characters.
Algorithm or framework	Yes	Data standardization
Send an SMS at the end of job	Yes	Send an SMS by default
Input data format	Yes	Options include sparse data without a weight value, sparse data with a weight value, and dense data. For details, see the following input data format description.
Input data type	Yes	Options include classification and regression. The label of the classification data is a discrete value, and the label of the regression data is a continuous value.
Input data path	Yes	Store the training data meeting the format. Supports entry of a single file or directory. If the input is the directory platform, the input data meeting the format requirements under the path gets standardized, including counting the number of samples and features, and converting the features of the string type to the number of the int types
Output path	Yes	Store the path of output data and log. After the job gets done successfully, store the standardized data in the path /{job_id}/data, and the log in the path/{job_id}/ log. You can use standardized data as the input data of other BML algorithms.
Computing resource	Yes	Currently, the BML only supports BML clusters.
Resource package	Yes	Currently, the BML only supports CPU instance_8 core _32GB memory.
Number of instances	Yes	2-4
Maximum running time	Yes	If the job runs for the maximum running time, the BML stops the job automatically, which may cause a job failure.

Descriptions for the input data format:

Format Type	Format ID	Sample Format	Descriptions
Sparse data without a weight value	SPARSE_ID	No,label,feature1,feature2,.....featureN	In the sample, the weight of the feature that appears is 1, and the weight of the feature that does not appear is 0. The number of features for each sample row may differ.
Sparse with weight value	SPARSE_ID_WEIGHT	No,label,feature1 weight1,feature2 weight2,.....featureN weightN	The weight of the feature in the sample is the corresponding weight value. The number of features in each sample row may differ, and there is a space between the feature and weight.
Dense data	DENSE	No,label,weight1,weight2,weight3.....weightN	According to requirements,the feature number of sample is 0,1,..., n-1, and the corresponding weight is weight0, weight1,... Weightn-1. The number of features in each sample row is N, which must be equal

The general limitations of the input data format are as follows:

No is the number of the sample in each row. There is no general limitation. It may be null. Note: when No is null, the sample needs to start with a comma, e.g., label, feature_1, feature_2)
The label is the annotation value of the sample in each row. It may be a discrete value and string. It may be null for unsupervised algorithms.
featureN n is the specific feature label of each sample row, which may be a discrete value and string.
weightN is the weight corresponding to the specific feature tag of each sample row, which may be a discrete value and a continuous value.

Notices:

Each field in the user data cannot contain comma and space separating the field. If there are such two characters in the user's original data, you should escape them in advance.
All rows that do not meet the format are ignorable, which significantly affects the effect of the model.
When the data is in sparseID format, the content between commas is regarded as a string, and the space is free from the check. You should know and handle it in advance.

Example Configuration:

The training data is the SUSY data downloaded from the Internet. A comma has been added at the beginning of each row of the data. Taking sed-i s / ^ /, / g yourfile as an example, it is cut and stored on the public BOS. After downloading the data, you can divide it into training data, evaluation data, and prediction data and store them on your BOS for data standardization. You can also directly use our public BOS data for training.
Input data format: Dense data
Input data type: Classification
Input data path: bos:/bml-public/automl-demo/data/susy-train/
Th output data path is to configure your BOS path.

Click "OK" to submit the job.

Descriptions for output data format:

Dataset.info is a data set information file, which records some statistical attributes of the data set, including the number of samples, number of features, and number of labels.
Under the preprocess_out directory is the ID-converted data.
Under the preprocess_dictionary is the ID-converted dictionary information and file information. In the feature dictionary featureIDMap file, the first row is the feature number feature_num, and the remaining feature_num rows are "feature ID original data feature string", separated by a space.
Under the preprocess_summary directory are the statistics information of the features and label. Of which, feature_summary is the feature statistics file, in the format of the original feature_flag (means the filtered flag, with 0 filtered.) feature-id feature_count feature_weight_avg feature_weight_max feature_weight_min. The label_summary is a label statistical file in the format of label_flag label_id label_count.

Logistic regression dichotomy

The BML logistic regression dichotomy is a method to realize the dichotomy model for the standardized training data. The algorithm can output a dichotomy model when providing standardized training data, use the evaluation data to evaluate the model and calculate evaluation indexes when providing standardized evaluation data, and use the prediction data to make predictions and save the prediction results to the BOS when providing standardized prediction data.

Configuration descriptions:

Configuration Name	Required	Description
Job name	Yes	It can only consist of numbers, letters,-or _ and can only start with letters, with less than 40 characters in length
Algorithm or framework	Yes	Logistic Regression Dichotomy
Send an SMS at the end of the job	Yes	Text by default
L1 regularization coefficient	Yes	0 <= L1 <= 1，floating-point number, scientific counting supported
L2 regularization coefficient	Yes	0 <= L2 <= 1，floating-point number, scientific counting supported
Convergence condition	Yes	0 < termination <= 0.1, floating-point number, scientific counting supported
Maximum number of iterations	Yes	When the model training reaches the maximum number of iterations or meets the convergence condition, the training stops. The maximum value is within 20 <= maxIter <= 200, and is a positive integer.
Training data path	Yes	Store the training data path after data standardization. That's to say, the output data path in data standardization job details). The algorithm uses this training data to train the model.
Evaluation data path	No	Store the evaluation data path after data standardization. That's to say, the output data path in the details of the data standardization job. The algorithm uses this evaluation data to evaluate the model. It stores the evaluation results in the model output path. / {job_id}/evaluate. If not entered, output the model directly, without an evaluation result.
Prediction data path	No	Store the prediction data path after data standardization (i.e., the output data path in the details of data standardization job). The algorithm uses the model to predict the prediction data, and stores the prediction results in the model output path / {job_id}/predict. If not entered, no prediction gets done.
Output path	Yes	Store the model and log. After the job is successful, store the model in the path /{job_id}/model, and the log in the path /{job_id}/log.
Computing resource	Yes	Currently, only BML clusters are supported
Resource package	Yes	Currently, only CPU instance _8 core _32GB memory supported
number of instances	Yes	2-4
Maximum running time	Yes	If you run the job for the maximum running time, BML automatically stops the job, which may cause a job failure

Descriptions for training data format:

You need to standardize the input data of the logistic regression dichotomy algorithm, and then use the standardized data as the input data of the logistic regression dichotomy algorithm.

Example configuration:

The training data is the SUSY data downloaded from the Internet. A comma has been added at the beginning of each row of data. For example, sed-i s / ^ /, / g yourfile, the data is cut and stored on the public BOS. After downloading the data, you can divide it into training data, evaluation data, and prediction data and store them on your own BOS. You can also directly use our public BOS data for training. It is noteworthy that you can use the standardized data as the input data of the logistic regression dichotomy algorithm.
Training data path: bos:/bml-public/automl-demo/data/susy-train/
Evaluation data path: bos:/bml-public/automl-demo/data/susy-test/
Prediction data path: bos:/bml-public/automl-demo/data/susy-all/
First, standardize the data of the above three paths and submit three data standardization jobs. After the jobs are all successful, copy the Data output path on the job details page, which is used to enter the input data path of the logical regression dichotomy. For example, copy the path in the red box.

The configuration of a new job in logical regression dichotomy is as follows:

Click "OK" to submit the job.

Descriptions for output data format:

The output model is mainly the weight parameters corresponding to each feature dimension in the lr model
The output is in plain text format. Each row represents a feature dimension. There are three fields divided by space, which are the weight parameter of the feature, the internal ID of the feature in the parameter adjustment algorithm, and the original name of the feature.

KMeans clustering

The main objective of the BML KMeans clustering is to realize the clustering model for the standardized training data. It can train the clustering model when providing standardized training data. And, it can use the trained model and prediction data to output prediction results and save them to the BOS when providing standardized prediction data.

Configuration instructions:

Configuration name	Required?	Description
Job name	Yes	It can only consist of numbers, letters,-or_ and can only start with letters, with less than 40 characters in length.
Algorithm or framework	Yes	KMeans Clustering
Send an SMS at the end of the job	Yes	Whether to send an SMS to inform the user after the job gets finished.
Number of clusters	Yes	Means the number of clusters, which is a positive integer within the range of [2, 3000].
Maximum number of iterations	Yes	If the model training reaches the maximum number of iterations or meets the convergence condition, the training stops. It is a positive integer within the range of [6, 10000].
Convergence condition	Yes	If the change rate of the sum of the distance from each point to the center point of the cluster is less than the convergence condition for 5 consecutive rounds, the training gets stopped. It is within the range of [0, 0.5]. It supports scientific counting, and uses a floating-point number
Initialization method of cluster center	Yes	Only INITCLUSTER_RANDOM is supported, i.e., you can select the starting point.
Distance calculation method	Yes	DISTANCE_EUCLIDEAN（Euclidean distance） DISTANCE_SQUAREEUCLIDEAN（SEUCLID） DISTANCE_MANHATTAN（manhatton distance) DISTANC\E_COSINE（cosine distance） DISTANCE_TANIMOTO（jaccard distance）
Center point storage mode	Yes	True indicates sparse storage and false indicates dense storage. Among them, the sparse storage reduces the computing efficiency. Thus, when the feature dimension is not high, try to use the dense storage mode. If the dimension is too high, and you use the dense storage mode, the job may fail because the memory is too small. Thus, when the dimension is too high, try to use the sparse storage mode.
Output training set clustering results or not	Yes	If true, save the training data clustering results in the model output path. /{jobid}/cluster; if false, do not save the training data clustering results, but output the model only.
Training data path	Yes	Store the standardized training data path (i.e., the output data path for the data standardization). The algorithm uses this training data to train the clustering model.
Prediction data path	No	Store the standardized prediction data (i.e., the output data path in the details of data standardization job). The algorithm uses the model to predict the prediction data. It stores the prediction results in the model output path / {job_id}/predict. If not entered, no prediction gets done.
Output path	Yes	Means the path to store the model and log. After the job is successful, store the model in the path /{job_id}/model, and the log in the path /{job_id}/log
Computing resources	Yes	Currently, it only supports the BML clusters.
Resource package	Yes	Currently, it only supports the CPU instance _8 core _32GB memory.
number of instances	Yes	2-4
Maximum running time	Yes	If you run the job for the maximum running time, the BML automatically stops the job, which may cause a job failure

Descriptions for training data format:

It is necessary to standardize the input data of the KMeans clustering algorithm. You can use the standardized data as the input data of the KMeans clustering algorithm.

Example configuration:

The training data is the Iris Flower Data Set downloaded from the Internet. It is divided into training data and prediction data and stored in the public BOS. After downloading the data, you can divide it into training data and prediction data and store them on your own BOS. You can also directly use our public BOS data for training. It noteworthy that you can use the standardized data as the input data of the KMeans clustering algorithm. The standardized training and prediction data are as follows:
Training data path: bos:/bml-public/ml-demo/data/iris-train-standardized/
Prediction data path: bos:/bml-public/ml-demo/data/iris-predict-standardized/
KMeans clustering create job example configuration is as follows:

Click "OK" to submit the job.

Introductions for output data format:

In the model file, the cluster_info is the center point vector, and distance_type is the model-related information, including distance formula and feature dimension, which are used for batch prediction.
When setting "output training set clustering result or not" to true, you can save the clustering result of the training data in the model output path / {jobid} / cluster. The format of each row in the file is: original sample data; category number.

RAPIDS-cuML

The cuML is a library, which implements the machine learning algorithm in the data science ecosystem of RAPIDS. The cuML enables developers to run traditional ML jobs on the GPU without having to learn more about the details of CUDA programming. With the cuML library, the user can program and call various algorithms (such as KMeans, and xgboost) in the cuML to realize the machine learning.

Configuration description:

Configuration Name	Required?	Description
Job name	Yes	It can only consist of numbers, letters,-or_ and can only start with letters, with less than 40 characters in length.
Algorithm or framework	Yes	RAPIDS-cuML
Send an SMS at the end of the job	Yes	Whether to send an SMS to inform the users after the job gets finished.
Input code	Yes	If desired to edit the code directly, the user can select the code template, modify and write the code directly in the black box, which is applicable to the scenario having one training file. If desired to select the code file, the user can enter the file path and starting command on the BOS, which is applicable to the scenario having multiple training files.
python Version	Yes	Only python3 is supported in cuML
Output path	Yes	Means the path to store the model output and log. You should put the trained model and data in the output directory of the container. The platform automatically uploads the content in the output directory of the container to the path / {job _id} / output, and the log to the path / {job _id} / log
Training data path	No	The platform automatically downloads the data under this path to the train_data directory under the container environment. If the job has multiple containers, assign each container download part of the data only.
Test data path	No	The platform automatically downloads the data under this path to the test_data directory under the container environment. If a job has multiple containers, assign each container to download part of the data only.
Computing resource	Yes	Currently, only BML clusters are supported
Resource package	Yes	Deep learning and development card and other GPU single card packages
Number of instances	Yes	1
Maximum running time	Yes	If you run the job for the maximum running time, BML automatically stops the job, which may cause a job failure

Descriptions for training data format:

There are no restrictions on the format of training data. You can write a reader function to process the input training data.

Example configuration:

The training data of the cuML-xgboost algorithm is the Mortgage Data downloaded from the Internet, and the data downloaded for 2 years are put on the public BOS. You can download data, store it on your own BOS, or use the public BOS data for training directly. Training data path: bos:/bml-public/ml-demo/data/cuml-xgboost/. The output path is to configure your own BOS path.

The example configuration of creating a job in RAPIDS-cuML is as follows:

Click "OK" to submit the job.

Descriptions for output data format:

The output model format is the same as the sklearn.Use pickle.dump (model, open (filename, 'wb') to save the model to a file, and pickle.load (open (filename, 'rb) to import the cuML model. Or, you can use joblib.dump (model, filename) to save the model to a file, and joblib.load (filename) to import the cuML model.

Terminate: Terminate the job that is currently running or in the queue. After the operation gets terminated, the system stops to upload the job results and job logs to the specified BOS path.
Clone: If you select to clone the configuration item of a job, enter the "Create Job" page.
Delete: This operation is to delete a job. If the job is still in the queue or running at the time of deletion, you need to terminate the queuing or running job, and then delete it. After deletion, the job disappears in the job list.
View job details: Click the job name to enter the job details, view job information, parameter information, and cluster information.
View running details: Click the job name, and select the running details tab to enter the running details page. Then, you can view the running status, start and end time, log details, and running curve.

View job results

After the job runs successfully, the model/data/log gets stored in the corresponding BOS address according to the output path specified during job configuration. The user needs to go to the BOS to view or download the job model, standardized data, or log.

Saving the job model/data/log is unsuccessful due to either of the following circumstances:

Terminate the job manually
The job gets automatically terminated due to running timeout
The job gets failed to run

If the user's job gets failed in case of either of the following circumstances:

The input data does not match the data format
The BOS address of the input data does not exist or is inaccessible
The bucket of the output log/model/data does not exist or is not inaccessible
A training timeout takes place

Deep Learning Job

AutoDL Job

百度智能云

Baidu Machine Learning

Machine Learning Job

Create a Job

Data Standardization

Logistic regression dichotomy

KMeans clustering

RAPIDS-cuML

View job results

Baidu Machine Learning

Machine Learning Job

Create a Job

Data Standardization

Logistic regression dichotomy

KMeans clustering

RAPIDS-cuML

Job list-related operations

View job results