Data Update
The data stored in Palo enter the system in the form of Append, which means that all the written data are unchangeable.
Thus, Palo uses mark to update the data. That is, in a batch of updated data, mark the previous data as Delete and write new data.
In reading process, Palo will automatically process these marked data (Merge-on-Read) to ensure that the user reads the latest data. At the same time, the data merging (Compaction) thread in Palo background will continue to merge the data and eliminate marked data to reduce the merging operation in the reading process and speed up the query.
Most scenarios to alter data are only applicable to Unique Key data model because only this model can guarantee the uniqueness of the primary key to support data update according to the primary key.
This document mainly introduces how to update data with Unique Key data model.
Data update
Please refer to the relevant documents for the description of UNIQUE KEY and we will not repeat here. The following is just an example.
-
Create a table of UNIQUE KEY model
CREATE TABLE order_table ( order_id BIGINT, order_type VARCHAR(8), order_status VARCHAR(32) ) UNIQUE KEY(order_id) DISTRIBUTED BY HASH(order_id) BUCKETS 8;
-
Import the first batch of data
1000, TYPE#1, PAID 1001, TYPE#2, PENDING 1002, TYPE#3, PAID
-
Alter field
order_status
of the order entry with ID 1001 toPAID
, and the following data should be loaded:1001, TYPE#2, PAID
Palo will alter the field
order_status
of this entry to PAID in the process of reading or background merging according to primary key 1001.
Update partial fields
In the previous example, we just need to update order_status
field, but the loaded data need to contain order_type
field.
In some scenarios, users only know the primary key and the values of partial fields to be updated rather than the whole column data. In this case, we can obtain the whole column data through aggregation method REPLACE_IF_NOT_NULL
.
UNIQUE_ KEY model, in essence, is a kind of aggregation model, and its aggregation type on non-primary key columns is REPLACE
by default. For REPLACE_IF_NOT_NULL
, it means no update when the value is null
. Examples are as follows:
-
Create a table of UNIQUE KEY model with the method of
REPLACE_IF_NOT_NULL
.CREATE TABLE order_table ( order_id BIGINT, order_type VARCHAR(8), order_status VARCHAR(32) ) UNIQUE KEY(order_id) DISTRIBUTED BY HASH(order_id) BUCKETS 8 PROPERTIES ( "replace_if_not_null" = "true" );
-
Import the first batch of data
1000, TYPE#1, PAID 1001, TYPE#2, PENDING 1002, TYPE#3, PAID
-
Alter field
order_status
of the order entry with ID 1001 toPAID
, and the following data should be loaded:1001, \N, PAID
\N
in original data means null. Palo will alter the fieldorder_status
of this entry to PAID in the process of reading or background merging according to primary key 1001. And the fieldorder_type
will not be replaced because it is null.
Update sequence
Within Palo, only two batches of loaded data can be guaranteed, and the data of the later batch will cover and update the data of the previous batch. However, if multiple line records with the same primary key appear in the same batch of data, Palo cannot identify which one is the final effective data.
Suppose that the loaded data of a certain batch are as follows:
1000, TYPE#1, PENDING
1001, TYPE#2, PENDING
1000, TYPE#3, PAID
Note that the primary keys of the first line and the third line are the same, so it is impossible to determine which one will take effect, and the status of the order 1000 queried by the user may be PENDING
or PAID
.
The business side, to solve the problem, needs to ensure that there are no lines with the same primary key in the same batch of data. Or refer to Sequence Column to carry out data adaptation.