Data Update

Last Updated：2021-12-08

The data stored in Palo enter the system in the form of Append, which means that all the written data are unchangeable.

Thus, Palo uses mark to update the data. That is, in a batch of updated data, mark the previous data as Delete and write new data.

In reading process, Palo will automatically process these marked data (Merge-on-Read) to ensure that the user reads the latest data. At the same time, the data merging (Compaction) thread in Palo background will continue to merge the data and eliminate marked data to reduce the merging operation in the reading process and speed up the query.

Most scenarios to alter data are only applicable to Unique Key data model because only this model can guarantee the uniqueness of the primary key to support data update according to the primary key.

This document mainly introduces how to update data with Unique Key data model.

Data update

Please refer to the relevant documents for the description of UNIQUE KEY and we will not repeat here. The following is just an example.

Create a table of UNIQUE KEY model

CREATE TABLE order_table
(
    order_id BIGINT,
    order_type VARCHAR(8),
    order_status VARCHAR(32)
)
UNIQUE KEY(order_id)
DISTRIBUTED BY HASH(order_id) BUCKETS 8;

Import the first batch of data

1000, TYPE#1, PAID
1001, TYPE#2, PENDING
1002, TYPE#3, PAID

Alter field order_statusof the order entry with ID 1001 to PAID, and the following data should be loaded:
```
1001, TYPE#2, PAID
```
Palo will alter the field order_status of this entry to PAID in the process of reading or background merging according to primary key 1001.

Update partial fields

In the previous example, we just need to update order_status field, but the loaded data need to contain order_type field.

In some scenarios, users only know the primary key and the values of partial fields to be updated rather than the whole column data. In this case, we can obtain the whole column data through aggregation method REPLACE_IF_NOT_NULL.

UNIQUE_ KEY model, in essence, is a kind of aggregation model, and its aggregation type on non-primary key columns is REPLACEby default. For REPLACE_IF_NOT_NULL, it means no update when the value is null. Examples are as follows:

Create a table of UNIQUE KEY model with the method of REPLACE_IF_NOT_NULL.

CREATE TABLE order_table
(
    order_id BIGINT,
    order_type VARCHAR(8),
    order_status VARCHAR(32)
)
UNIQUE KEY(order_id)
DISTRIBUTED BY HASH(order_id) BUCKETS 8
PROPERTIES
(
    "replace_if_not_null" = "true"
);

Import the first batch of data

1000, TYPE#1, PAID
1001, TYPE#2, PENDING
1002, TYPE#3, PAID

Alter field order_statusof the order entry with ID 1001 to PAID, and the following data should be loaded:
```
1001, \N, PAID
```
\N in original data means null. Palo will alter the field order_status of this entry to PAID in the process of reading or background merging according to primary key 1001. And the field order_type will not be replaced because it is null.

Update sequence

Within Palo, only two batches of loaded data can be guaranteed, and the data of the later batch will cover and update the data of the previous batch. However, if multiple line records with the same primary key appear in the same batch of data, Palo cannot identify which one is the final effective data.

Suppose that the loaded data of a certain batch are as follows:

1000, TYPE#1, PENDING
1001, TYPE#2, PENDING
1000, TYPE#3, PAID

Note that the primary keys of the first line and the third line are the same, so it is impossible to determine which one will take effect, and the status of the order 1000 queried by the user may be PENDING or PAID.

The business side, to solve the problem, needs to ensure that there are no lines with the same primary key in the same batch of data. Or refer to Sequence Column to carry out data adaptation.

Data Load

Data Delete

百度智能云

Data Warehouse

Data Update

Data update

Update partial fields

Update sequence