Sequence-Column
In UNIQUE KEY mode, Palo will automatically update the data according to the primary key. When rows with the same primary key appear in the same batch of loaded data, however, Palo can't judge their sequence, which can lead to inconsistent update behavior.
In some data synchronization scenarios, it is necessary to ensure that the data can be updated in order, which is the function of Sequence Column.
Implementation principle
Sequence Column only supports tables of UNIQUE KEY model. The principle is to add a hidden column __DORIS_SEQUENCE_COL__
to the table. The user needs to specify the column type when creating the table.
In the loaded source data, the user needs to add an additional sequence column whose type __DORIS_SEQUENCE_COL__
is specified when creating the table. Palo will determine the order of the data and update according to the value of the sequence column in this order.
Enable Sequence Column function
This is a new function introduced after Palo 3.10.
-
Create a new table
Set Sequence Column as follows when creating a table:
CREATE TABLE order_table ( order_id BIGINT, order_type VARCHAR(8), order_status VARCHAR(32) ) UNIQUE KEY(order_id) DISTRIBUTED BY HASH(order_id) BUCKETS 8 PROPERTIES ( "function_column.sequence_type" = 'Date' );
Here, we specify Sequence Column type in
PROPERTIES
as Date to enable this function. Refer to CREATE TABLE command manual for more instructions. -
Enable this function for old tables
For tables of UNIQUE KEY created before version 3.10, enable this function through the following commands:
ALTER TABLE order_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")
This operation, in essence, is a Schema Change operation, view the job execution progress through SHOW ALTER TABLE COLUMN after execution.
Set a variable to show hidden columns to determine whether a table has enabled Mark deletion function.
SET show_hidden_columns=true`
Then use DESC tablename
. If there is __DORIS_SEQUENCE_COL__
column in output, it indicates that the function has been enabled in the table.
Use sequence column function in the load
The methods used in different data load modes are slightly different. This function currently supports the following data load modes:
Refer to the respective documents for specific syntax. Here, we only give simple examples of different load methods. Suppose that the original loaded data are as follows:
1000,TYPE#1,PENDING,2020-10-01
1001,TYPE#2,PAID,2020-10-02
1002,TYPE#3,PENDING,2020-10-03
1001,TYPE#2,PENDING,2020-10-01
1004,TYPE#3,PAID,2020-10-03
Stream Load
curl --location-trusted -u root \
-H "columns: order_id, order_type, order_status, source_sequence"
-H "function_column.sequence_col: source_sequence" \
-T data.txt http://host:port/api/example_db/order/_stream_load
We name the fourth column source_sequence
in columns
attribute of Header, then set the column as sequence column in attribute function_column.sequence_col
.
Thus, the final status of the order 1001 in the source data will be PAID
.
Broker Load
LOAD LABEL example_db.label1
(
DATA INFILE("hdfs://host:port/user/data/*/test.txt")
INTO TABLE `order`
COLUMNS TERMINATED BY ","
(order_id, order_type, order_status, source_sequence)
ORDER BY source_sequence
)
WITH BROKER 'bos'
(
...
);
Configure the sequence column by clause ORDER BY
.
routine load
CREATE ROUTINE LOAD example_db.job_name ON order
COLUMNS(order_id, order_type, order_status, source_sequence),
ORDER BY source_sequence
PROPERTIES
(
...
)
FROM KAFKA
(
...
);
Configure the sequence column by clause ORDER BY
.
Usage examples
Here take Stream Load as an example to show the use method and effect of sequence column through an actual example.
-
Create a table that supports Sequence Column
CREATE TABLE test_table ( user_id BIGINT, date DATE, group_id BIGINT, keyword VARCHAR(128) ) UNIQUE KEY(user_id, date, group_id) DISTRIBUTED BY HASH(user_id, date) BUCKETS 10 PROPERTIES ( "function_column.sequence_type" = 'Date' )
Then we can view the hidden columns:
mysql> set show_hidden_columns=true; Query OK, 0 rows affected (0.00 sec) mysql> desc test_table; +------------------------+--------------+------+-------+---------+---------+ | Field | Type | Null | Key | Default | Extra | +------------------------+--------------+------+-------+---------+---------+ | user_id | BIGINT | Yes | true | NULL | | | date | DATE | Yes | true | NULL | | | group_id | BIGINT | Yes | true | NULL | | | keyword | VARCHAR(128) | Yes | false | NULL | REPLACE | | __DORIS_SEQUENCE_COL__ | DATE | Yes | false | NULL | REPLACE | +------------------------+--------------+------+-------+---------+---------+ 5 rows in set (0.00 sec)
-
Load data normally
Load the following data:
1,2020-02-22,1,2020-02-22,a 1,2020-02-22,1,2020-02-22,b 1,2020-02-22,1,2020-03-05,c 1,2020-02-22,1,2020-02-26,d 1,2020-02-22,1,2020-02-22,e 1,2020-02-22,1,2020-02-22,b
Map the Sequence Column as the fourth column in the source data, column
modify_date
.curl --location-trusted -u root: \ -H "column_separator: ," \ -H "columns: user_id, date, group_id, modify_date, keyword" \ -H "function_column.sequence_col: modify_date" \ -T testData http://host:port/api/test/test_table/_stream_load
The result is:
mysql> select * from test_table; +---------+------------+----------+---------+ | user_id | date | group_id | keyword | +---------+------------+----------+---------+ | 1 | 2020-02-22 | 1 | c | +---------+------------+----------+---------+
Also, we can view the values of hidden columns:
mysql> set show_hidden_columns=true; Query OK, 0 rows affected (0.01 sec) mysql> select * from test_table; +---------+------------+----------+---------+------------------------+ | user_id | date | group_id | keyword | __DORIS_SEQUENCE_COL__ | +---------+------------+----------+---------+------------------------+ | 1 | 2020-02-22 | 1 | c | 2020-03-05 | +---------+------------+----------+---------+------------------------+
In this load, because
2020-03-05
is the maximum value in the value of Sequence Column (the value ofmodify_date
),c
is finally retained in thekeyword
column. -
Guarantee of replacement sequence
Impot the following data after the above steps are completed:
1,2020-02-22,1,2020-02-22,a 1,2020-02-22,1,2020-02-23,b
Query data
MySQL [test]> select * from test_table; +---------+------------+----------+---------+ | user_id | date | group_id | keyword | +---------+------------+----------+---------+ | 1 | 2020-02-22 | 1 | c | +---------+------------+----------+---------+
No replacement occurs because the Sequence Column of the newly-loaded data are all smaller than the existing values in the table.
Then try loading the following data:
1,2020-02-22,1,2020-02-22,a 1,2020-02-22,1,2020-03-23,w
Query data:
MySQL [test]> select * from test_table; +---------+------------+----------+---------+ | user_id | date | group_id | keyword | +---------+------------+----------+---------+ | 1 | 2020-02-22 | 1 | w | +---------+------------+----------+---------+
The data are replaced because the Sequence Column value of the newly-loaded data is greater than the value in the table.