Load Overview
Supported data sources
Palo provides a variety of data load schemes to choose from for different data sources.
Data sources | load methods |
---|---|
Baidu Object Storage (BOS), HDFS, AFS | Load data with Broker Load |
Local file | Load local data |
Baidu News Service (Kafka) | Subscribe to Kafka log |
MySQL, Oracle, PostgreSQL | Synchronize data through external table |
load data through JDBC | Synchronize data throughJDBC |
load data in JSON format | Instructions of importing data in JSON format |
MySQL binlog | Please wait and see |
General description of data load
The following are the instructions for common features of data load for Palo for users to better use the function.
Atomicity guarantee
Every load job in Palo is a complete transaction operation, whether through Broker Load for batch load or through INSERT statement for single import. The load transaction can ensure that the data atoms in a batch take effect instead of partial data write.
Also, each load job has a Label, which is unique under a Database and is used to uniquely identify an load job. Labels can be specified by users, and partial load functions can be generated automatically by the system.
Label is used to ensure that the corresponding load job can be successfully imported only once. A successfully imported label will be rejected and an error Label already used
will be reported if it is used again. At-Most-Once
semantic can be done in Palo through this mechanism. Combined with At-Least-Once
semantic of the upstream system, Exactly-Once
semantic of the imported data can be realized.
Refer to load transaction and atomicity for best practices on atomicity guarantee.
Synchronous load and asynchronous import
load methods include synchronous one and asynchronous one. For synchronous import, the return result indicates the success or failure of the import. For asynchronous import, successful return simply means successful operation submission rather than successful data load, the user need to view the running status of the load job through corresponding commands.
Supported data formats
Supported data formats slightly differ in terms of different load methods.
Load methods | Supported formats |
---|---|
Broker Load | Parquet, ORC, csv, gzip |
Stream Load | csv, gzip, json |
Routine Load | csv, json |