Column mapping, converting and filtering
Palo supports rich column mapping, converting, and filtering operations to flexibly deal with the original data to be loaded.
This document mainly describes how to load data with these functions.
General introductions
The data processing steps of Palo in loading are as follows:
- The data are read into Palo in column order in the original file.
- The original data are filtered for the first time through PRECEDING FILTER condition.
- Through column mapping and converting, the original data are mapped to the target column order.
- The converted data are filtered again by post filter condition (WHERE).
- Write final data.
Column mapping, converting and filtering parameters are optional in the load job. In default vacancy, Palo will split the rows in the source file according to the default column separator \t
, and then correspond them to the table columns in order. If the number of columns in the source file does not match the number of columns in the table, data quality problems will occur and the data cannot be loaded. In this case, we need to explicitly describe the column mapping, converting and filtering information.
Supported load methods
-
LOAD LABEL example_db.label1 ( DATA INFILE("bos://bucket/input/file") INTO TABLE `my_table` (k1, k2, tmpk3) PRECEDING FILTER k1 = 1 SET ( k3 = tmpk3 + 1 ) WHERE k1 > k2 ) WITH BROKER bos ( ... );
-
curl --location-trusted -u user:passwd -H "columns: k1, k2, tmpk3, k3 = tmpk3 + 1" -H "where: k1 > k2" -T file.txt http://host:port/api/testDb/testTbl/_stream_load
-
CREATE ROUTINE LOAD example_db.label1 ON my_table COLUMNS(k1, k2, tmpk3, k3 = tmpk3 + 1), PRECEDING FILTER k1 = 1, WHERE k1 > k2 ...
All the above load methods support column mapping, converting and filtering of source data:
-
Preceding filter: Filter the original data that have been read for the first time.
PRECEDING FILTER k1 = 1
-
Mapping: Define the columns in source data. If the defined column name is the same as the column in the table, it will be mapped directly to the column in the table. If they are different, the defined column can be used for subsequent converting. As is shown in the example above:
(k1, k2, tmpk3)
-
Converting: Convert the mapped columns in the first step with built-in expressions, functions, and self-defined functions, and remap them to the corresponding columns in the table. As is shown in the example above:
k3 = tmpk3 + 1
-
Post filter: Filter the mapped and converted columns with expressions. Filtered data rows will not be loaded into the system. As is shown in the example above:
WHERE k1 > k2
Column mapping
Column mapping mainly aims to describe the information of each column in the loaded file, which is equivalent to defining the column name in the source data. We can load the source files with different order and number of columns in the table into Palo through the description of column mapping relationship. Here are the examples:
Suppose there are 4 columns in the source file, as is shown below (The column name of the header is only for convenience, actually there is no header):
Column 1 | Column 2 | Column 3 | Column 4 |
---|---|---|---|
1 | 100 | beijing | 1.1 |
2 | 200 | shanghai | 1.2 |
3 | 300 | guangzhou | 1.3 |
4 | \N | chongqing | 1.4 |
Note:
\N
means null in source file.
-
Adjust mapping order
Suppose there are 4 columns
k1,k2,k3,k4
in the table. The expected load mapping relationship is as follows:Column 1 -> k1 Column 2 -> k3 Column 3 -> k2 Column 4 -> k4
Then the writing order of column mapping should be as follows:
(k1, k3, k2, k4)
-
There are more columns in the source file than in the table
Suppose there are 3 columns
k1,k2,k3
in the table. The expected load mapping relationship is as follows:Column 1 -> k1 Column 2 -> k3 Column 3 -> k2
Then the writing order of column mapping should be as follows:
(k1, k3, k2, tmpk4)
Where
tmpk4
is a self-defined and nonexistent column name which can be ignored by Palo in the table. -
If there are less columns in the source file than in the table, fill in with default values.
Suppose there are 5 columns
k1,k2,k3,k4,k5
in the table. The expected load mapping relationship is as follows:Column 1 -> k1 Column 2 -> k3 Column 3 -> k2
Here we only use the first 3 columns in the source file. Columns
k4,k5
are expected to be filled in with default values.Then the writing order of column mapping should be as follows:
(k1, k3, k2)
If columns
k4,k5
have default values, the default values will be filled in. Otherwise, if it is anullable
column, thenull
value will be filled in. Otherwise, the load job will report an error.
Column preceding filter
Preceding filter aims to filter, for the first time, the original data that have been read. At present, only BROKER LOAD and ROUTINE LOAD are supported.
Preceding filter is applied in the following scenarios:
-
Filter before converting
Scenarios that are expected to be filtered before column mapping and converting. Some unnecessary data can be filtered out first.
-
Filter column does not exist in table and is only used as filter ID
The data of multiple tables, for example, are stored in the source data (Or the data of multiple tables are written to the same Kafka message queue). Each row in the data has a list name to identify which table the data in this row belongs to. Users can screen the corresponding table data to load them through preceding filtering conditions.
Column converting
The column converting allows users to convert the column values in the source file. Currently, Palo supports most built-in functions and self-defined functions for converting.
Note: The self-defined function belongs to a certain database, so the user needs read permission to the database to convert when using self-defined functions.
Converting are usually defined with column mapping, which means to map the columns first and then convert them. Here are the examples:
Suppose there are 4 columns in the source file, as is shown below (The column name of the header is only for convenience, actually there is no header):
Column 1 | Column 2 | Column 3 | Column 4 |
---|---|---|---|
1 | 100 | beijing | 1.1 |
2 | 200 | shanghai | 1.2 |
3 | 300 | guangzhou | 1.3 |
4 | 400 | chongqing | 1.4 |
-
Load column values from source file into table after being converted
Suppose there are 4 columns
k1,k2,k3,k4
in the table. The expected load mapping and converting relationships are as follows:Column 1 -> k1 Column 2 * 100 -> k3 Column 3 -> k2 Column 4 -> k4
Then the writing order of column mapping should be as follows:
(k1, tmpk3, k2, k4, k3 = tmpk3 * 100)
This is equivalent to naming the second column in the source file as
tmpk3
, and specifying the value of columnk3
in the table astmpk3 * 100
. The final data in the table are as follows:k1 k2 k3 k4 1 beijing 10000 1.1 2 shanghai 20000 1.2 3 guangzhou 30000 1.3 null chongqing 40000 1.4 -
Conditional column converting can be achieved through case when function.
Suppose there are 4 columns
k1,k2,k3,k4
in the table. We hope to loadbeijing, shanghai, guangzhou, chongqing
in the source data after respectively being converted to the corresponding region ID:Column 1 -> k1 Column 2 -> k2 Column 3 after being converted to region id -> k3 Column 4 -> k4
Then the writing order of column mapping should be as follows:
(k1, k2, tmpk3, k4, k3 = case tmpk3 when "beijing" then 1 when "shanghai" then 2 when "guangzhou" then 3 when "chongqing" then 4 else null end)
The final data in the table are as follows:
k1 k2 k3 k4 1 100 1 1.1 2 200 2 1.2 3 300 3 1.3 null 400 4 1.4 -
Convert the null value in the source file to 0 and then load. And convert region id in example 2.
Suppose there are 4 columns
k1,k2,k3,k4
in the table. We wish to convert the null value in column K1 to 0 and then load when converting region id:Column 1 If null, convert to 0 -> k1 Column 2 -> k2 Column 3 -> k3 Column 4 -> k4
Then the writing order of column mapping should be as follows:
(tmpk1, k2, tmpk3, k4, k1 = ifnull(tmpk1, 0), k3 = case tmpk3 when "beijing" then 1 when "shanghai" then 2 when "guangzhou" then 3 when "chongqing" then 4 else null end)
The final data in the table are as follows:
k1 k2 k3 k4 1 100 1 1.1 2 200 2 1.2 3 300 3 1.3 0 400 4 1.4
Column filter
We can, after column mapping and converting, filter the data that we don't want to load into Palo through filtering conditions. Examples are as follows:
Suppose there are 4 columns in the source file, as is shown below (The column name of the header is only for convenience, actually there is no header):
Column 1 | Column 2 | Column 3 | Column 4 |
---|---|---|---|
1 | 100 | beijing | 1.1 |
2 | 200 | shanghai | 1.2 |
3 | 300 | guangzhou | 1.3 |
4 | 400 | chongqing | 1.4 |
-
Filter directly in the case of column mapping and converting default
Suppose there are 4 columns
k1,k2,k3,k4
in the table. We can directly define filter conditions in the case of column mapping and converting default. If we want to load only the data rows with column 4 larger than 1.2 in the source file, the filter conditions are as follows:where k4 > 1.2
The final data in the table are as follows:
k1 k2 k3 k4 3 300 guangzhou 1.3 null 400 chongqing 1.4 By default, Palo maps columns in order, so the fourth column in the source file is automatically mapped to column
k4
in the table. -
Filter the data after column converting
Suppose there are 4 columns
k1,k2,k3,k4
in the table. We converted the province names to id in Column converting example. Here we want to filter out the data with id being 3. Converting and filter conditions are as follows:(k1, k2, tmpk3, k4, k3 = case tmpk3 when "beijing" then 1 when "shanghai" then 2 when "guangzhou" then 3 when "chongqing" then 4 else null end) where k3 != 3
The final data in the table are as follows:
k1 k2 k3 k4 1 100 1 1.1 2 200 2 1.2 null 400 4 1.4 Here we can see that the column values when filtering are the final column values after mapping and converting, not the original data.
-
Multi-condition filtering
Suppose there are 4 columns
k1,k2,k3,k4
in the table. If we want to filter out the data whose columnk1
isnull
and the data whose columnk4
is less than 1.2, the filter conditions are as follows:where k1 is null and k4 < 1.2
The final data in the table are as follows:
k1 k2 k3 k4 2 200 2 1.2 3 300 3 1.3
Data quality issues and filter thresholds
There are three types of data rows to be processed in the load job:
-
Filtered Rows
Data filtered out due to unqualified data quality. Unqualified data quality includes data format problems such as type error, precision error, super long string length, mismatched number of file columns, and data rows filtered out due to no corresponding sections.
-
Unselected Rows
Data rows filtered out due to filter conditions of columns
preceding filter
orwhere
. -
Loaded Rows
Data rows that are correctly loaded.
Palo load task allows users to set the maximum error rate (max_filter_ratio
). If error rate of loaded data is lower than the threshold, these error rows will be ignored and other correct data will be loaded.
How to calculate error rate:
#Filtered Rows / (#Filtered Rows + #Loaded Rows)
In other words, Unselected Rows
will not participate in calculating error rate.