Baidu AI Cloud
中国站

百度智能云

Data Warehouse

EXPORT

EXPORT

Description

This statement is used to export the data of the specified table to the specified location.

This is an asynchronous operation, which returns if the task is submitted successfully. The progress can be view by SHOW EXPORT command.

EXPORT TABLE table_name
[PARTITION (p1[,p2])]
TO export_path
[opt_properties]
WITH BROKER
[broker_properties];
  • table_name

    The table name of the table currently being exported. Only export of Palo local table data is supported.

  • partition

    Only some specified partitions of a specified table can be exported.

  • export_path

    The exported path must be a directory.

  • opt_properties

    Used to specify some export parameters.

    [PROPERTIES ("key"="value", ...)]

    The following parameters can be specified:

    • column_separator:Specify the column separator for export, the default is \ t. Only single byte is supported.
    • line_delimiter:Specify the line delimiter for export, the default is \ t. Only single byte is supported.
    • exec_mem_limit:Export the maximum memory usage of a single BE node, which is 2GB by default, and the unit is bytes.
    • timeout:The timeout of load job is 2 hours by default, and the unit is seconds.
    • tablet_num_per_task:The maximum number of Tablet that can be scanned per subtask.
  • WITH BROKER

    The export function needs to write data to the remote storage through the Broker process. Here, it is required to define relevant connection information for Broker to use.

    WITH BROKER bos ("key"="value"[,...])

Example

  1. Export all data in testTbl table to hdfs

    EXPORT TABLE testTbl
    TO "bos://my_buckets/export/"
    WITH BROKER bos
    (
        "bos_endpoint" = "http://bj.bcebos.com",
        "bos_accesskey" = "xxxxxxxxxxxxxxxxxxxxxxxxxx",
        "bos_secret_accesskey"="yyyyyyyyyyyyyyyyyyyyyyyyyy"
    );
  2. Export partitions P1 p1, p2 in testTbl table to hdfs.

    EXPORT TABLE testTbl
    TO "bos://my_buckets/export/"
    (
        "column_separator" = ",",
        "tablet_num_per_task" = "10"
    )
    WITH BROKER bos
    (
        "bos_endpoint" = "http://bj.bcebos.com",
        "bos_accesskey" = "xxxxxxxxxxxxxxxxxxxxxxxxxx",
        "bos_secret_accesskey"="yyyyyyyyyyyyyyyyyyyyyyyyyy"
    );

Keywords

EXPORT

Best Practices

Split of subtasks

An Export job will be split into several subtasks (execution plans) for execution. How many query plans need to be executed depends on the total number of tablets and the maximum number of tablets that can be allocated to a query plan.

Because multiple query plans are executed in series, if one query plan handles more slices, the execution time of the job can be reduced.

However, if the query plan goes wrong (for example, RPC calling Broker fails, remote storage jitters, etc.), too many Tablet will lead to higher retry cost of a query plan.

Therefore, it is necessary to reasonably arrange the number of query plans and the number of slices that each query plan needs to scan, so as to balance the execution time and the success rate of execution.

Generally, it is recommended that the amount of data scanned by a query should be within 3-5 GB.

Memory limit

Usually, the query plan of an Export job has only two parts: 扫描-导出and does not involve the calculation logic that needs too much memory. Therefore, the default memory limit of 2GB is usually enough.

However, in some scenarios, such as a query plan, too many tablets need to BE scanned on the same BE, or too many data versions of tablets may lead to insufficient memory. At this time, it is necessary to set larger memory, such as 4GB and 8GB, throughexec_mem_limit .

Notes

  • Exporting large amounts of data at once is not recommended. The maximum recommended amount of Exported data for an export job is tens of GB. Excessive export will lead to more junk files and higher retry cost. If the amount of table data is too large, it is recommended to export by partition.
  • If the Export job fails to run, the __doris_export_tmp_xxx temporary directory generated in the remote storage and the generated files will not be deleted, and need to be deleted manually by the user.
  • If the Export job runs successfully, the __doris_export_tmp_xxx directory generated in the remote storage may be kept or cleared according to the file system semantics of the remote storage. For example, in Baidu Object Store (BOS), after the last file in a directory is removed by rename operation, the directory will also be deleted. If the directory is not cleared, the user can clear it manually.
  • The Export job only exports the data of the Base table, not the materialized view.
  • The Export operation scans data, which takes up IO resources and may affect the query delay of the system.
  • The maximum number of Export jobs running simultaneously in a cluster is 5. Only jobs submitted after that will be queued.
Previous
BACKUP
Next
STREAM-LOAD