百度智能云

All Product Document

          Data Warehouse

          Load Data in BOS

          This document mainly teaches how to load data stored in BOS.

          Preparations

          Please first store the data to be loaded into Palo on Baidu Object Storage (BOS) through the following steps.

          1. Open BOS service

            Refer to Start to use BOS

          2. Create Bucket

            Refer to Create Bucket

            Note: the domain of the Bucket must be the same as that of the Palo cluster. The Palo domain can usually be viewed in the upper left corner of the Palo console page

          3. Upload files to Bucket

            There are two ways to upload files to Bucket.

            Upload files directly through the console, pleaser refer to document Upload Object.

            Upload files via command line tool:

            1. First [Download BOS CLI command line tool](https://cloud.baidu.com/doc/BOS/s/Ejwvyqobd#bos-cliDownload Address#). Here take Linux operation system bce-cli-0.10.10.zip as an example.
            2. Execute the following command to configure BOS CLI after decompression:

              ./bce -c
              BOS Access Key ID []: 353b8dexxxxxxxxxxb156d3
              BOS Secret Access Key []: ea15a18xxxxxx29f78e8d77
              BCE Security Token [None]:
              Default region name [bj]:
              Default domain [bj.bcebos.com]:
              Default use auto switch domain [yes]:
              Default breakpoint_file_expiration [7] days:
              Default use https protocol [no]:
              Default multi upload thread num [10]: 
              • BOS Access Key ID and BOS Secret Access Key can be obtained by clicking Account profile -> Security authentication on the top right corner of the public cloud page.
              • Fill in the abbreviations of domain of Bucket in Default region name and Default domain, please refer to [Domain Access](https://cloud.baidu.com/doc/BOS/s/Ck1rk80hn#Access Domain Name (endpoint)#) .
              • Other configurations can be configured by default.
            3. Upload the file with the following command:

               ./bce bos cp /path/to/local/your_file.txt bos:/your_bucket_name

          Start load

          Palo supports the following two ways to load BOS data.

          Submit load job through Broker Load command

          Broker, a stateless process service, has been built into Palo cluster and is mainly used to read and write files from external data sources. Broker Load is used to access source data and load data with Broker service.

          1. Create a table

            Create a table to store the data to be loaded through CREATE TABLE command. Refer to command manual CREATE TABLE for specific load methods. Examples are as follows:

            CREATE TABLE IF NOT EXISTS load_test
            (
                id INT,
                name VARCHAR(128)
            )
            DISTRIBUTED BY HASH(id) BUCKETS 8;
          2. Submit Broker Load load job

            Refer to command manual Broker Load for detailed Broker Load command syntax. Examples are as follows:

            LOAD LABEL example_db.exmpale_label_1
            (
                DATA INFILE("bos://your_bucket_name/your_file.txt")
                INTO TABLE load_test
                COLUMNS TERMINATED BY ","
            )
            WITH BROKER "bos"
            (
                "bos_endpoint" = "http://bj.bcebos.com",
                "bos_accesskey" = "353b8dexxxxxxxxxxb156d3",
                "bos_secret_accesskey" = "ea15a18xxxxxx29f78e8d77"
            )
            PROPERTIES
            (
                "timeout" = "3600"
            );
            • LABEL: Each load job needs a specified and unique Label, which can be used to view the running status of the load job.
            • WITH BROKER "bos": "bos" is just the Broker service process name and does not represent the data source to be accessed. Broker name can be viewed through SHOW BROKER command after connecting to Palo with the user of admin.
            • "bos_accesskey" and "bos_secret_accesskey" can be obtained by clicking Account profile -> Security authentication on the top right corner of the public cloud page.
            • "bos_endpoint" is related to the domain of BOS Bucket, refer to [Domain Access](https://cloud.baidu.com/doc/BOS/s/Ck1rk80hn#Access Domain Name (endpoint)#) to obtain.
          3. View load job status

            Broker Load is an asynchronous command, successful execution of the command in the second step only means successful submission of the job. Check the following commands to see specific execution.

            mysql> SHOW LOAD FROM example_db WHERE LABEL="exmpale_label_1"
            *************************** 1. row ***************************
                     JobId: 10041
                     Label: exmpale_label_1
                     State: FINISHED
                  Progress: ETL:100%; LOAD:100%
                      Type: BROKER
                   EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=100000000
                  TaskInfo: cluster:N/A; timeout(s):14400; max_filter_ratio:0.0
                  ErrorMsg: NULL
                CreateTime: 2020-11-17 09:38:04
              EtlStartTime: 2020-11-17 09:38:09
             EtlFinishTime: 2020-11-17 09:38:09
             LoadStartTime: 2020-11-17 09:38:09
            LoadFinishTime: 2020-11-17 09:42:07
                       URL: N/A
                JobDetails: {"Unfinished backends":{},"ScannedRows":0,"TaskNumber":0,"All backends":{},"FileNumber":0,"FileSize":0}
            1 row in set (0.01 sec)

            If the status of the State field is FINISHED, the load is successful and the data can be queried. Refer to SHOW LOAD command file for a specific description of the SHOW LOAD return result.

          4. Cancel load job

            The following commands can cancel a running Broker Load load job:

            CANCEL LOAD WHERE LABEL="exmpale_label_1";

            All loaded data will be rolled back after successful cancellation. Palo will automatically ensure that the data atom in an load job takes effect.

          Note: Refer to BROKER LOAD command file for more detailed and advanced functions of the Broker Load command.

          load through external table

          Also, Palo supports creating a Broker external table to refer to the data stored in BOS, and then loading the data through INSERT INTO SELECT.

          1. Create a table

            Create a table for data storage. Same as above and there is no need to repeat.

          2. Create Broker external table

            Refer to CREATE EXTERNAL TABLE command manual for specific instructions for the command of creating external table. Examples are as follows:

            CREATE EXTERNAL TABLE IF NOT EXISTS example_db.example_ext_table
            (
                id INT,
                name VARCHAR(128)
            )
            ENGINE=BROKER
            PROPERTIES
            (
                "broker_name" = "bos",
                "path" = "bos://your_bucket_name/your_file.txt",
            )
            BROKER PROPERTIES
            (
                "bos_endpoint" = "http://bj.bcebos.com",
                "bos_accesskey" = "353b8dexxxxxxxxxxb156d3",
                "bos_secret_accesskey" = "ea15a18xxxxxx29f78e8d77"
            );
            • ENGINE: The type of ENGINE is BROKER, which means that it is an external table that accesses data with Broker service.
            • "broker_name" is "bos","bos" is just the Broker service process name and does not represent the data source to be accessed. Broker name can be viewed through SHOW BROKER command after connecting to Palo with the user of admin.
            • "bos_accesskey" and "bos_secret_accesskey" can be obtained by clicking Account profile -> Security authentication on the top right corner of the public cloud page.
            • "bos_endpoint" is related to the domain of BOS Bucket, refer to [Domain Access](https://cloud.baidu.com/doc/BOS/s/Ck1rk80hn#Access Domain Name(endpoint)#) to obtain.

          Note: external table data can also be queried directly through SELECT with low efficiency, so it is recommended to execute the query after loading them into Palo.

          1. load data

            load data from an external table to an internal table with the following command.

            sql INSERT INTO load_test SELECT * FROM example_ext_table; ​

            The command is a synchronous command (the operation of submitting INSERT job asynchronously is under development), the successful command return result means that the data load is completed. When the loaded data amount is large, the task may be canceled due to query timeout. Refer to variable query_timeout in Varialbe file for the setting of query timeout.

          Note: Refer to INSERT command file for more detailed and advanced functions of INSERT command.

          Previous
          Load Local Data
          Next
          Subscribe Kafka Log