Mark Deletion

Last Updated：2021-09-26

Mark deletion function is a supplement to the deletion function of DELETE statement. Deleting data with DELETE statement cannot support high-frequency operation scenarios.

Additionally, similar to the CDC (Change Data Capture) scenario, INSERT and DELETE usually appear alternately.

Mark deletion function is designed to support the above two scenarios.

Implementation principle

Mark deletion function only applies to tables of UNIQUE KEY model. The principle is to add a hidden column __DELETE_mark__ to the table. If the column value is true, it means that the row is carried out a deletion operation, if it is false, it means that the row is carried out an insert operation. The user only needs to add in the loaded data to hide the column.

Suppose the table structure is as follows:

Column name	Primary key
order_id	Yes
order_type	No
order_status	No

Insert the following three data lines:

1000, TYPE#1, PENDING, false
1001, TYPE#2, PENDING, false
1002, TYPE#3, PENDING, false

The values of the fourth columns are all false, which means that the three lines are all carried out insert operations. The query results after execution are as follows:

order_id	order_type	order_status
1000	TYPE#1	PENDING
1001	TYPE#2	PENDING
1002	TYPE#3	PENDING

Next, load two more lines of data:

1001, TYPE#2, PENDING, true
1002, TYPE#3, PAID, false

The first line is carried out a deletion operation, which is deleting the data corresponding to the order ID 1001（The 2 values TYPE#2, PENDING are meaningless in this data line and can be filled in at will, but they must be filled in to ensure that the number of columns matches）.

The second line is carried out an insert operation, which is equivalent to updating the status of the order 1002 to PAID.

The query results after execution are as follows:

order_id	order_type	order_status
1000	TYPE#1	PENDING
1002	TYPE#3	PAID

As seen from the above examples that the user can realize the update or deletion operation of primary key in conjunction with the UNIQUE KEY model. Set the corresponding hidden column as true to delete, while the value of hidden column is false, the operation is similar to UPSERT operation, that is, update if there is any, insert if there is none.

Enable Mark deletion function

Mark deletion is a new function introduced after Palo version 3.10.

The UNIQUE KEY tables created in the new version contain hidden columns by default. While the UNIQUE KEY tables created in previous versions do not have hidden columns. For tables of old versions, add hidden columns in the following ways:

ALTER TABLE tablename ENABLE FEATURE "BATCH_DELETE";

This operation, in essence, is a Schema Change operation，refer to SHOW ALTER TABLE COLUMN to check job execution progress after execution.

Set a variable to show hidden columns if you want to determine whether a table has enabled Mark deletion function.

SET show_hidden_columns=true`

Then use DESC tablename, if there is a column __DELETE_mark__ in output, it means that the table has enabled Mark deletion function.

Use mark deletion function in load

The methods of using Mark deletion function in different data load methods are slightly different. Mark deletion currently supports the following data load methods:

Add the following two attributes to use Mark deletion function in these loads:

Merge Type

Merge Type has 3 types, APPEND, DELETE and MERGE, and APPEND is the default.

APPEND method has no difference from normal load.

DELETE method means that each row in the batch of data is carried out a deletion operation. And there is no need to specify the column Delete Label in this method, .

MERGE method means that this batch of data is carried out the mixed operations of both insert and deletion. At this time, the user needs to identify the operation type of each row by specifying the column Delete Label.
Delete Label

Delete Label aims to, in MERGE method, specify that a certain column in the data is the column with deletion label. The user can specify the value of this column as DELETE operation.

Refer to the respective documents for specific usage syntax. Here, we only give simple examples of different load modes. Suppose that the original loaded data are as follows:

1000,TYPE#1,PENDING,0
1001,TYPE#2,PENDING,0
1002,TYPE#3,PENDING,0
1003,TYPE#2,PENDING,1
1004,TYPE#3,PAID,1

Stream Load

The request of Stream Load is as follows：

curl --location-trusted -u root \
-H "columns: order_id, order_type, order_status, delete_label" \
-H "merge_type: MERGE" \
-H "delete: delete_label=1" \
-T data.txt http://host:port/api/testDb/testTbl/_stream_load

This example shows that the fourth column is Delete Label column, and the corresponding line is carried out deletion operation when the value is 1.

Broker Load

LOAD LABEL example_db.my_label
(
    MERGE DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file1")
    INTO TABLE example_tbl
    COLUMNS TERMINATED BY ","
    (order_id, order_type, order_status, delete_label)
    DELETE ON delete_label=1
)
WITH BROKER 'bos'
(
    ...
);

This example shows that the fourth column is Delete Label column, and the corresponding line is carried out deletion operation when the value is 1.

Routine Load

CREATE ROUTINE LOAD example_db.job1 ON example_tbl
WITH DELETE
COLUMNS(order_id, order_type, order_status, delete_label)
PROPERTIES
(
    "desired_concurrent_number"="3",
    "max_batch_interval" = "20",
    "max_batch_rows" = "300000",
    "max_batch_size" = "209715200",
    "strict_mode" = "false"
)
FROM KAFKA
(
    ...
);

Note that the Merge Type we use here is DELETE, so we don't need to specify Delete Label, the value of the fourth column will not be used, and all data are carried out deletion operation.

Points for attention

Palo is unable to guarantee the internal sequence of a batch of loaded data, so it is necessary to cooperate with Sequence Column to ensure data sequence in scenarios such as CDC.

Data Delete

Sequence-Column

百度智能云

Data Warehouse