Load Data in BOS
This document mainly teaches how to load data stored in BOS.
Preparations
Please first store the data to be loaded into Palo on Baidu Object Storage (BOS) through the following steps.
-
Open BOS service
Refer to Start to use BOS
-
Create Bucket
Refer to Create Bucket
Note: the domain of the Bucket must be the same as that of the Palo cluster. The Palo domain can usually be viewed in the upper left corner of the Palo console page
-
Upload files to Bucket
There are two ways to upload files to Bucket.
Upload files directly through the console, pleaser refer to document Upload Object.
Upload files via command line tool:
- First [Download BOS CLI command line tool](https://cloud.baidu.com/doc/BOS/s/Ejwvyqobd#bos-cliDownload Address#). Here take Linux operation system bce-cli-0.10.10.zip as an example.
-
Execute the following command to configure BOS CLI after decompression:
./bce -c BOS Access Key ID []: 353b8dexxxxxxxxxxb156d3 BOS Secret Access Key []: ea15a18xxxxxx29f78e8d77 BCE Security Token [None]: Default region name [bj]: Default domain [bj.bcebos.com]: Default use auto switch domain [yes]: Default breakpoint_file_expiration [7] days: Default use https protocol [no]: Default multi upload thread num [10]:
- BOS Access Key ID and BOS Secret Access Key can be obtained by clicking
Account profile -> Security authentication
on the top right corner of the public cloud page. - Fill in the abbreviations of domain of Bucket in Default region name and Default domain, please refer to [Domain Access](https://cloud.baidu.com/doc/BOS/s/Ck1rk80hn#Access Domain Name (endpoint)#) .
- Other configurations can be configured by default.
- BOS Access Key ID and BOS Secret Access Key can be obtained by clicking
-
Upload the file with the following command:
./bce bos cp /path/to/local/your_file.txt bos:/your_bucket_name
Start load
Palo supports the following two ways to load BOS data.
Submit load job through Broker Load command
Broker, a stateless process service, has been built into Palo cluster and is mainly used to read and write files from external data sources. Broker Load is used to access source data and load data with Broker service.
-
Create a table
Create a table to store the data to be loaded through
CREATE TABLE
command. Refer to command manual CREATE TABLE for specific load methods. Examples are as follows:CREATE TABLE IF NOT EXISTS load_test ( id INT, name VARCHAR(128) ) DISTRIBUTED BY HASH(id) BUCKETS 8;
-
Submit Broker Load load job
Refer to command manual Broker Load for detailed Broker Load command syntax. Examples are as follows:
LOAD LABEL example_db.exmpale_label_1 ( DATA INFILE("bos://your_bucket_name/your_file.txt") INTO TABLE load_test COLUMNS TERMINATED BY "," ) WITH BROKER "bos" ( "bos_endpoint" = "http://bj.bcebos.com", "bos_accesskey" = "353b8dexxxxxxxxxxb156d3", "bos_secret_accesskey" = "ea15a18xxxxxx29f78e8d77" ) PROPERTIES ( "timeout" = "3600" );
- LABEL: Each load job needs a specified and unique Label, which can be used to view the running status of the load job.
- WITH BROKER "bos": "bos" is just the Broker service process name and does not represent the data source to be accessed. Broker name can be viewed through
SHOW BROKER
command after connecting to Palo with the user of admin. - "bos_accesskey" and "bos_secret_accesskey" can be obtained by clicking
Account profile -> Security authentication
on the top right corner of the public cloud page. - "bos_endpoint" is related to the domain of BOS Bucket, refer to [Domain Access](https://cloud.baidu.com/doc/BOS/s/Ck1rk80hn#Access Domain Name (endpoint)#) to obtain.
-
View load job status
Broker Load is an asynchronous command, successful execution of the command in the second step only means successful submission of the job. Check the following commands to see specific execution.
mysql> SHOW LOAD FROM example_db WHERE LABEL="exmpale_label_1" *************************** 1. row *************************** JobId: 10041 Label: exmpale_label_1 State: FINISHED Progress: ETL:100%; LOAD:100% Type: BROKER EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=100000000 TaskInfo: cluster:N/A; timeout(s):14400; max_filter_ratio:0.0 ErrorMsg: NULL CreateTime: 2020-11-17 09:38:04 EtlStartTime: 2020-11-17 09:38:09 EtlFinishTime: 2020-11-17 09:38:09 LoadStartTime: 2020-11-17 09:38:09 LoadFinishTime: 2020-11-17 09:42:07 URL: N/A JobDetails: {"Unfinished backends":{},"ScannedRows":0,"TaskNumber":0,"All backends":{},"FileNumber":0,"FileSize":0} 1 row in set (0.01 sec)
If the status of the
State
field isFINISHED
, the load is successful and the data can be queried. Refer toSHOW LOAD
command file for a specific description of theSHOW LOAD
return result. -
Cancel load job
The following commands can cancel a running Broker Load load job:
CANCEL LOAD WHERE LABEL="exmpale_label_1";
All loaded data will be rolled back after successful cancellation. Palo will automatically ensure that the data atom in an load job takes effect.
Note: Refer to BROKER LOAD command file for more detailed and advanced functions of the Broker Load command.
load through external table
Also, Palo supports creating a Broker external table to refer to the data stored in BOS, and then loading the data through INSERT INTO SELECT
.
-
Create a table
Create a table for data storage. Same as above and there is no need to repeat.
-
Create Broker external table
Refer to CREATE EXTERNAL TABLE command manual for specific instructions for the command of creating external table. Examples are as follows:
CREATE EXTERNAL TABLE IF NOT EXISTS example_db.example_ext_table ( id INT, name VARCHAR(128) ) ENGINE=BROKER PROPERTIES ( "broker_name" = "bos", "path" = "bos://your_bucket_name/your_file.txt", ) BROKER PROPERTIES ( "bos_endpoint" = "http://bj.bcebos.com", "bos_accesskey" = "353b8dexxxxxxxxxxb156d3", "bos_secret_accesskey" = "ea15a18xxxxxx29f78e8d77" );
- ENGINE: The type of ENGINE is BROKER, which means that it is an external table that accesses data with Broker service.
- "broker_name" is "bos","bos" is just the Broker service process name and does not represent the data source to be accessed. Broker name can be viewed through
SHOW BROKER
command after connecting to Palo with the user of admin. - "bos_accesskey" and "bos_secret_accesskey" can be obtained by clicking
Account profile -> Security authentication
on the top right corner of the public cloud page. - "bos_endpoint" is related to the domain of BOS Bucket, refer to [Domain Access](https://cloud.baidu.com/doc/BOS/s/Ck1rk80hn#Access Domain Name(endpoint)#) to obtain.
Note: external table data can also be queried directly through
SELECT
with low efficiency, so it is recommended to execute the query after loading them into Palo.
-
load data
load data from an external table to an internal table with the following command.
sql INSERT INTO load_test SELECT * FROM example_ext_table;
The command is a synchronous command (the operation of submitting INSERT job asynchronously is under development), the successful command return result means that the data load is completed. When the loaded data amount is large, the task may be canceled due to query timeout. Refer to variable
query_timeout
in Varialbe file for the setting of query timeout.
Note: Refer to INSERT command file for more detailed and advanced functions of INSERT command.