Sequence-Column

Last Updated：2021-11-02

In UNIQUE KEY mode, Palo will automatically update the data according to the primary key. When rows with the same primary key appear in the same batch of loaded data, however, Palo can't judge their sequence, which can lead to inconsistent update behavior.

In some data synchronization scenarios, it is necessary to ensure that the data can be updated in order, which is the function of Sequence Column.

Implementation principle

Sequence Column only supports tables of UNIQUE KEY model. The principle is to add a hidden column __DORIS_SEQUENCE_COL__ to the table. The user needs to specify the column type when creating the table.

In the loaded source data, the user needs to add an additional sequence column whose type __DORIS_SEQUENCE_COL__ is specified when creating the table. Palo will determine the order of the data and update according to the value of the sequence column in this order.

Enable Sequence Column function

This is a new function introduced after Palo 3.10.

Create a new table

Set Sequence Column as follows when creating a table:

CREATE TABLE order_table
(
    order_id BIGINT,
    order_type VARCHAR(8),
    order_status VARCHAR(32)
)
UNIQUE KEY(order_id)
DISTRIBUTED BY HASH(order_id) BUCKETS 8
PROPERTIES
(
    "function_column.sequence_type" = 'Date'
);

Here, we specify Sequence Column type inPROPERTIES as Date to enable this function. Refer to CREATE TABLE command manual for more instructions.

Enable this function for old tables

For tables of UNIQUE KEY created before version 3.10, enable this function through the following commands:
```
ALTER TABLE order_table ENABLE FEATURE "SEQUENCE_LOAD"
WITH PROPERTIES ("function_column.sequence_type" = "Date")
```
This operation, in essence, is a Schema Change operation, view the job execution progress through SHOW ALTER TABLE COLUMN after execution.

Set a variable to show hidden columns to determine whether a table has enabled Mark deletion function.

SET show_hidden_columns=true`

Then use DESC tablename. If there is __DORIS_SEQUENCE_COL__column in output, it indicates that the function has been enabled in the table.

Use sequence column function in the load

The methods used in different data load modes are slightly different. This function currently supports the following data load modes:

Refer to the respective documents for specific syntax. Here, we only give simple examples of different load methods. Suppose that the original loaded data are as follows:

1000,TYPE#1,PENDING,2020-10-01
1001,TYPE#2,PAID,2020-10-02
1002,TYPE#3,PENDING,2020-10-03
1001,TYPE#2,PENDING,2020-10-01
1004,TYPE#3,PAID,2020-10-03

Stream Load

curl --location-trusted -u root \
-H "columns: order_id, order_type, order_status, source_sequence"
-H "function_column.sequence_col: source_sequence" \
-T data.txt http://host:port/api/example_db/order/_stream_load

We name the fourth column source_sequencein columns attribute of Header, then set the column as sequence column in attribute function_column.sequence_col.

Thus, the final status of the order 1001 in the source data will be PAID.

Broker Load

LOAD LABEL example_db.label1
(
    DATA INFILE("hdfs://host:port/user/data/*/test.txt")
    INTO TABLE `order`
    COLUMNS TERMINATED BY ","
    (order_id, order_type, order_status, source_sequence)
    ORDER BY source_sequence
)
WITH BROKER 'bos'
(
    ...
);

Configure the sequence column by clause ORDER BY.

routine load

CREATE ROUTINE LOAD example_db.job_name ON order 
COLUMNS(order_id, order_type, order_status, source_sequence),
ORDER BY source_sequence
PROPERTIES
(
    ...
)
FROM KAFKA
(
    ...
);

Configure the sequence column by clause ORDER BY.

Usage examples

Here take Stream Load as an example to show the use method and effect of sequence column through an actual example.

Create a table that supports Sequence Column

CREATE TABLE test_table
(
    user_id     BIGINT,
    date        DATE,
    group_id    BIGINT,
    keyword     VARCHAR(128)
)
UNIQUE KEY(user_id, date, group_id)
DISTRIBUTED BY HASH(user_id, date) BUCKETS 10
PROPERTIES
(
    "function_column.sequence_type" = 'Date'
)

Then we can view the hidden columns:

mysql> set show_hidden_columns=true;
Query OK, 0 rows affected (0.00 sec)
mysql> desc test_table;
+------------------------+--------------+------+-------+---------+---------+
| Field                  | Type         | Null | Key   | Default | Extra   |
+------------------------+--------------+------+-------+---------+---------+
| user_id                | BIGINT       | Yes  | true  | NULL    |         |
| date                   | DATE         | Yes  | true  | NULL    |         |
| group_id               | BIGINT       | Yes  | true  | NULL    |         |
| keyword                | VARCHAR(128) | Yes  | false | NULL    | REPLACE |
| __DORIS_SEQUENCE_COL__ | DATE         | Yes  | false | NULL    | REPLACE |
+------------------------+--------------+------+-------+---------+---------+
5 rows in set (0.00 sec)

Load data normally

Load the following data:

1,2020-02-22,1,2020-02-22,a
1,2020-02-22,1,2020-02-22,b
1,2020-02-22,1,2020-03-05,c
1,2020-02-22,1,2020-02-26,d
1,2020-02-22,1,2020-02-22,e
1,2020-02-22,1,2020-02-22,b

Map the Sequence Column as the fourth column in the source data, column modify_date.

curl --location-trusted -u root: \
-H "column_separator: ," \
-H "columns: user_id, date, group_id, modify_date, keyword" \
-H "function_column.sequence_col: modify_date" \
-T testData http://host:port/api/test/test_table/_stream_load

The result is:

mysql> select * from test_table;
+---------+------------+----------+---------+
| user_id | date       | group_id | keyword |
+---------+------------+----------+---------+
|       1 | 2020-02-22 |        1 | c       |
+---------+------------+----------+---------+

Also, we can view the values of hidden columns:

mysql> set show_hidden_columns=true;
Query OK, 0 rows affected (0.01 sec)

mysql> select * from test_table;
+---------+------------+----------+---------+------------------------+
| user_id | date       | group_id | keyword | __DORIS_SEQUENCE_COL__ |
+---------+------------+----------+---------+------------------------+
|       1 | 2020-02-22 |        1 | c       | 2020-03-05             |
+---------+------------+----------+---------+------------------------+

In this load, because 2020-03-05 is the maximum value in the value of Sequence Column (the value of modify_date), c is finally retained in the keyword column.

Guarantee of replacement sequence

Impot the following data after the above steps are completed:

1,2020-02-22,1,2020-02-22,a
1,2020-02-22,1,2020-02-23,b

Query data

MySQL [test]> select * from test_table;
+---------+------------+----------+---------+
| user_id | date       | group_id | keyword |
+---------+------------+----------+---------+
|       1 | 2020-02-22 |        1 | c       |
+---------+------------+----------+---------+

No replacement occurs because the Sequence Column of the newly-loaded data are all smaller than the existing values in the table.

Then try loading the following data:

1,2020-02-22,1,2020-02-22,a
1,2020-02-22,1,2020-03-23,w

Query data:

MySQL [test]> select * from test_table;
+---------+------------+----------+---------+
| user_id | date       | group_id | keyword |
+---------+------------+----------+---------+
|       1 | 2020-02-22 |        1 | w       |
+---------+------------+----------+---------+

The data are replaced because the Sequence Column value of the newly-loaded data is greater than the value in the table.

Mark Deletion

Data Export

百度智能云

Data Warehouse