Column mapping, converting and filtering

Last Updated：2021-12-08

Palo supports rich column mapping, converting, and filtering operations to flexibly deal with the original data to be loaded.

This document mainly describes how to load data with these functions.

General introductions

The data processing steps of Palo in loading are as follows:

The data are read into Palo in column order in the original file.
The original data are filtered for the first time through PRECEDING FILTER condition.
Through column mapping and converting, the original data are mapped to the target column order.
The converted data are filtered again by post filter condition (WHERE).
Write final data.

Column mapping, converting and filtering parameters are optional in the load job. In default vacancy, Palo will split the rows in the source file according to the default column separator \t, and then correspond them to the table columns in order. If the number of columns in the source file does not match the number of columns in the table, data quality problems will occur and the data cannot be loaded. In this case, we need to explicitly describe the column mapping, converting and filtering information.

Supported load methods

BROKER LOAD

LOAD LABEL example_db.label1
(
    DATA INFILE("bos://bucket/input/file")
    INTO TABLE `my_table`
    (k1, k2, tmpk3)
    PRECEDING FILTER k1 = 1
    SET (
        k3 = tmpk3 + 1
    )
    WHERE k1 > k2
)
WITH BROKER bos
(
    ...
);

STREAM LOAD

curl
--location-trusted
-u user:passwd
-H "columns: k1, k2, tmpk3, k3 = tmpk3 + 1"
-H "where: k1 > k2"
-T file.txt
http://host:port/api/testDb/testTbl/_stream_load

ROUTINE LOAD

CREATE ROUTINE LOAD example_db.label1 ON my_table
COLUMNS(k1, k2, tmpk3, k3 = tmpk3 + 1),
PRECEDING FILTER k1 = 1,
WHERE k1 > k2
...

All the above load methods support column mapping, converting and filtering of source data:

Preceding filter: Filter the original data that have been read for the first time.
```
PRECEDING FILTER k1 = 1
```
Mapping: Define the columns in source data. If the defined column name is the same as the column in the table, it will be mapped directly to the column in the table. If they are different, the defined column can be used for subsequent converting. As is shown in the example above:
```
(k1, k2, tmpk3)
```
Converting: Convert the mapped columns in the first step with built-in expressions, functions, and self-defined functions, and remap them to the corresponding columns in the table. As is shown in the example above:
```
k3 = tmpk3 + 1
```
Post filter: Filter the mapped and converted columns with expressions. Filtered data rows will not be loaded into the system. As is shown in the example above:
```
WHERE k1 > k2
```

Column mapping

Column mapping mainly aims to describe the information of each column in the loaded file, which is equivalent to defining the column name in the source data. We can load the source files with different order and number of columns in the table into Palo through the description of column mapping relationship. Here are the examples:

Suppose there are 4 columns in the source file, as is shown below (The column name of the header is only for convenience, actually there is no header):

Column 1	Column 2	Column 3	Column 4
1	100	beijing	1.1
2	200	shanghai	1.2
3	300	guangzhou	1.3
4	\N	chongqing	1.4

Note:\N means null in source file.

Adjust mapping order

Suppose there are 4 columns k1,k2,k3,k4 in the table. The expected load mapping relationship is as follows:
```
Column 1 -> k1
Column 2 -> k3
Column 3 -> k2
Column 4 -> k4
```
Then the writing order of column mapping should be as follows:
```
(k1, k3, k2, k4)
```
There are more columns in the source file than in the table

Suppose there are 3 columns k1,k2,k3 in the table. The expected load mapping relationship is as follows:
```
Column 1 -> k1
Column 2 -> k3
Column 3 -> k2
```
Then the writing order of column mapping should be as follows:
```
(k1, k3, k2, tmpk4)
```
Where tmpk4 is a self-defined and nonexistent column name which can be ignored by Palo in the table.
If there are less columns in the source file than in the table, fill in with default values.

Suppose there are 5 columns k1,k2,k3,k4,k5 in the table. The expected load mapping relationship is as follows:
```
Column 1 -> k1
Column 2 -> k3
Column 3 -> k2
```
Here we only use the first 3 columns in the source file. Columns k4,k5 are expected to be filled in with default values.

Then the writing order of column mapping should be as follows:
```
(k1, k3, k2)
```
If columnsk4,k5 have default values, the default values will be filled in. Otherwise, if it is a nullable column, thenull value will be filled in. Otherwise, the load job will report an error.

Column preceding filter

Preceding filter aims to filter, for the first time, the original data that have been read. At present, only BROKER LOAD and ROUTINE LOAD are supported.

Preceding filter is applied in the following scenarios:

Filter before converting

Scenarios that are expected to be filtered before column mapping and converting. Some unnecessary data can be filtered out first.
Filter column does not exist in table and is only used as filter ID

The data of multiple tables, for example, are stored in the source data (Or the data of multiple tables are written to the same Kafka message queue). Each row in the data has a list name to identify which table the data in this row belongs to. Users can screen the corresponding table data to load them through preceding filtering conditions.

Column converting

The column converting allows users to convert the column values in the source file. Currently, Palo supports most built-in functions and self-defined functions for converting.

Note: The self-defined function belongs to a certain database, so the user needs read permission to the database to convert when using self-defined functions.

Converting are usually defined with column mapping, which means to map the columns first and then convert them. Here are the examples:

Suppose there are 4 columns in the source file, as is shown below (The column name of the header is only for convenience, actually there is no header):

Column 1	Column 2	Column 3	Column 4
1	100	beijing	1.1
2	200	shanghai	1.2
3	300	guangzhou	1.3
4	400	chongqing	1.4

Load column values from source file into table after being converted

Suppose there are 4 columns k1,k2,k3,k4 in the table. The expected load mapping and converting relationships are as follows:
```
Column 1       -> k1
Column 2 * 100 -> k3
Column 3       -> k2
Column 4       -> k4
```
Then the writing order of column mapping should be as follows:
```
(k1, tmpk3, k2, k4, k3 = tmpk3 * 100)
```
This is equivalent to naming the second column in the source file as tmpk3, and specifying the value of column k3 in the table astmpk3 * 100. The final data in the table are as follows:

k1 k2 k3 k4

1 beijing 10000 1.1

2 shanghai 20000 1.2

3 guangzhou 30000 1.3

null chongqing 40000 1.4
Conditional column converting can be achieved through case when function.

Suppose there are 4 columns k1,k2,k3,k4 in the table. We hope to load beijing, shanghai, guangzhou, chongqing in the source data after respectively being converted to the corresponding region ID:
```
Column 1                  -> k1
Column 2                  -> k2
Column 3 after being converted to region id    -> k3
Column 4                  -> k4
```
Then the writing order of column mapping should be as follows:
```
(k1, k2, tmpk3, k4, k3 = case tmpk3 when "beijing" then 1 when "shanghai" then 2 when "guangzhou" then 3 when "chongqing" then 4 else null end)
```
The final data in the table are as follows:

k1 k2 k3 k4

1 100 1 1.1

2 200 2 1.2

3 300 3 1.3

null 400 4 1.4
Convert the null value in the source file to 0 and then load. And convert region id in example 2.

Suppose there are 4 columns k1,k2,k3,k4 in the table. We wish to convert the null value in column K1 to 0 and then load when converting region id:
```
Column 1 If null, convert to 0   -> k1
Column 2                      -> k2
Column 3                      -> k3
Column 4                      -> k4
```
Then the writing order of column mapping should be as follows:
```
(tmpk1, k2, tmpk3, k4, k1 = ifnull(tmpk1, 0), k3 = case tmpk3 when "beijing" then 1 when "shanghai" then 2 when "guangzhou" then 3 when "chongqing" then 4 else null end)
```
The final data in the table are as follows:

k1 k2 k3 k4

1 100 1 1.1

2 200 2 1.2

3 300 3 1.3

0 400 4 1.4

k1	k2	k3	k4
1	beijing	10000	1.1
2	shanghai	20000	1.2
3	guangzhou	30000	1.3
null	chongqing	40000	1.4

k1	k2	k3	k4
1	100	1	1.1
2	200	2	1.2
3	300	3	1.3
null	400	4	1.4

k1	k2	k3	k4
1	100	1	1.1
2	200	2	1.2
3	300	3	1.3
0	400	4	1.4

Column filter

We can, after column mapping and converting, filter the data that we don't want to load into Palo through filtering conditions. Examples are as follows:

Suppose there are 4 columns in the source file, as is shown below (The column name of the header is only for convenience, actually there is no header):

Column 1	Column 2	Column 3	Column 4
1	100	beijing	1.1
2	200	shanghai	1.2
3	300	guangzhou	1.3
4	400	chongqing	1.4

Filter directly in the case of column mapping and converting default

Suppose there are 4 columns k1,k2,k3,k4 in the table. We can directly define filter conditions in the case of column mapping and converting default. If we want to load only the data rows with column 4 larger than 1.2 in the source file, the filter conditions are as follows:
```
where k4 > 1.2
```
The final data in the table are as follows:

k1 k2 k3 k4

3 300 guangzhou 1.3

null 400 chongqing 1.4

By default, Palo maps columns in order, so the fourth column in the source file is automatically mapped to column k4 in the table.
Filter the data after column converting

Suppose there are 4 columns k1,k2,k3,k4 in the table. We converted the province names to id in Column converting example. Here we want to filter out the data with id being 3. Converting and filter conditions are as follows:
```
(k1, k2, tmpk3, k4, k3 = case tmpk3 when "beijing" then 1 when "shanghai" then 2 when "guangzhou" then 3 when "chongqing" then 4 else null end)
where k3 != 3
```
The final data in the table are as follows:

k1 k2 k3 k4

1 100 1 1.1

2 200 2 1.2

null 400 4 1.4

Here we can see that the column values when filtering are the final column values after mapping and converting, not the original data.
Multi-condition filtering

Suppose there are 4 columns k1,k2,k3,k4 in the table. If we want to filter out the data whose column k1 is nulland the data whose column k4 is less than 1.2, the filter conditions are as follows:
```
where k1 is null and k4 < 1.2
```
The final data in the table are as follows:

k1 k2 k3 k4

2 200 2 1.2

3 300 3 1.3

k1	k2	k3	k4
3	300	guangzhou	1.3
null	400	chongqing	1.4

k1	k2	k3	k4
1	100	1	1.1
2	200	2	1.2
null	400	4	1.4

k1	k2	k3	k4
2	200	2	1.2
3	300	3	1.3

Data quality issues and filter thresholds

There are three types of data rows to be processed in the load job:

Filtered Rows

Data filtered out due to unqualified data quality. Unqualified data quality includes data format problems such as type error, precision error, super long string length, mismatched number of file columns, and data rows filtered out due to no corresponding sections.
Unselected Rows

Data rows filtered out due to filter conditions of columnspreceding filter or where.
Loaded Rows

Data rows that are correctly loaded.

Palo load task allows users to set the maximum error rate (max_filter_ratio). If error rate of loaded data is lower than the threshold, these error rows will be ignored and other correct data will be loaded.

How to calculate error rate:

#Filtered Rows / (#Filtered Rows + #Loaded Rows)

In other words, Unselected Rows will not participate in calculating error rate.

Load Transaction and Atomicity

Strict Mode

百度智能云

Data Warehouse

Column mapping, converting and filtering

General introductions

Supported load methods

Column mapping

Column preceding filter

Column converting

Column filter

Data quality issues and filter thresholds