百度智能云

All Product Document

          Data Warehouse

          Column mapping, converting and filtering

          Palo supports rich column mapping, converting, and filtering operations to flexibly deal with the original data to be loaded.

          This document mainly describes how to load data with these functions.

          General introductions

          The data processing steps of Palo in loading are as follows:

          1. The data are read into Palo in column order in the original file.
          2. The original data are filtered for the first time through PRECEDING FILTER condition.
          3. Through column mapping and converting, the original data are mapped to the target column order.
          4. The converted data are filtered again by post filter condition (WHERE).
          5. Write final data.

          Column mapping, converting and filtering parameters are optional in the load job. In default vacancy, Palo will split the rows in the source file according to the default column separator \t, and then correspond them to the table columns in order. If the number of columns in the source file does not match the number of columns in the table, data quality problems will occur and the data cannot be loaded. In this case, we need to explicitly describe the column mapping, converting and filtering information.

          Supported load methods

          • BROKER LOAD

            LOAD LABEL example_db.label1
            (
                DATA INFILE("bos://bucket/input/file")
                INTO TABLE `my_table`
                (k1, k2, tmpk3)
                PRECEDING FILTER k1 = 1
                SET (
                    k3 = tmpk3 + 1
                )
                WHERE k1 > k2
            )
            WITH BROKER bos
            (
                ...
            );
          • STREAM LOAD

            curl
            --location-trusted
            -u user:passwd
            -H "columns: k1, k2, tmpk3, k3 = tmpk3 + 1"
            -H "where: k1 > k2"
            -T file.txt
            http://host:port/api/testDb/testTbl/_stream_load
          • ROUTINE LOAD

            CREATE ROUTINE LOAD example_db.label1 ON my_table
            COLUMNS(k1, k2, tmpk3, k3 = tmpk3 + 1),
            PRECEDING FILTER k1 = 1,
            WHERE k1 > k2
            ...

          All the above load methods support column mapping, converting and filtering of source data:

          • Preceding filter: Filter the original data that have been read for the first time.

            PRECEDING FILTER k1 = 1
          • Mapping: Define the columns in source data. If the defined column name is the same as the column in the table, it will be mapped directly to the column in the table. If they are different, the defined column can be used for subsequent converting. As is shown in the example above:

            (k1, k2, tmpk3)
          • Converting: Convert the mapped columns in the first step with built-in expressions, functions, and self-defined functions, and remap them to the corresponding columns in the table. As is shown in the example above:

            k3 = tmpk3 + 1
          • Post filter: Filter the mapped and converted columns with expressions. Filtered data rows will not be loaded into the system. As is shown in the example above:

            WHERE k1 > k2

          Column mapping

          Column mapping mainly aims to describe the information of each column in the loaded file, which is equivalent to defining the column name in the source data. We can load the source files with different order and number of columns in the table into Palo through the description of column mapping relationship. Here are the examples:

          Suppose there are 4 columns in the source file, as is shown below (The column name of the header is only for convenience, actually there is no header):

          Column 1 Column 2 Column 3 Column 4
          1 100 beijing 1.1
          2 200 shanghai 1.2
          3 300 guangzhou 1.3
          4 \N chongqing 1.4

          Note:\N means null in source file.

          1. Adjust mapping order

            Suppose there are 4 columns k1,k2,k3,k4 in the table. The expected load mapping relationship is as follows:

            Column 1 -> k1
            Column 2 -> k3
            Column 3 -> k2
            Column 4 -> k4

            Then the writing order of column mapping should be as follows:

            (k1, k3, k2, k4)
          2. There are more columns in the source file than in the table

            Suppose there are 3 columns k1,k2,k3 in the table. The expected load mapping relationship is as follows:

            Column 1 -> k1
            Column 2 -> k3
            Column 3 -> k2

            Then the writing order of column mapping should be as follows:

            (k1, k3, k2, tmpk4)

            Where tmpk4 is a self-defined and nonexistent column name which can be ignored by Palo in the table.

          3. If there are less columns in the source file than in the table, fill in with default values.

            Suppose there are 5 columns k1,k2,k3,k4,k5 in the table. The expected load mapping relationship is as follows:

            Column 1 -> k1
            Column 2 -> k3
            Column 3 -> k2

            Here we only use the first 3 columns in the source file. Columns k4,k5 are expected to be filled in with default values.

            Then the writing order of column mapping should be as follows:

            (k1, k3, k2)

            If columnsk4,k5 have default values, the default values will be filled in. Otherwise, if it is a nullable column, thenull value will be filled in. Otherwise, the load job will report an error.

          Column preceding filter

          Preceding filter aims to filter, for the first time, the original data that have been read. At present, only BROKER LOAD and ROUTINE LOAD are supported.

          Preceding filter is applied in the following scenarios:

          1. Filter before converting

            Scenarios that are expected to be filtered before column mapping and converting. Some unnecessary data can be filtered out first.

          2. Filter column does not exist in table and is only used as filter ID

            The data of multiple tables, for example, are stored in the source data (Or the data of multiple tables are written to the same Kafka message queue). Each row in the data has a list name to identify which table the data in this row belongs to. Users can screen the corresponding table data to load them through preceding filtering conditions.

          Column converting

          The column converting allows users to convert the column values in the source file. Currently, Palo supports most built-in functions and self-defined functions for converting.

          Note: The self-defined function belongs to a certain database, so the user needs read permission to the database to convert when using self-defined functions.

          Converting are usually defined with column mapping, which means to map the columns first and then convert them. Here are the examples:

          Suppose there are 4 columns in the source file, as is shown below (The column name of the header is only for convenience, actually there is no header):

          Column 1 Column 2 Column 3 Column 4
          1 100 beijing 1.1
          2 200 shanghai 1.2
          3 300 guangzhou 1.3
          4 400 chongqing 1.4
          1. Load column values from source file into table after being converted

            Suppose there are 4 columns k1,k2,k3,k4 in the table. The expected load mapping and converting relationships are as follows:

            Column 1       -> k1
            Column 2 * 100 -> k3
            Column 3       -> k2
            Column 4       -> k4

            Then the writing order of column mapping should be as follows:

            (k1, tmpk3, k2, k4, k3 = tmpk3 * 100)

            This is equivalent to naming the second column in the source file as tmpk3, and specifying the value of column k3 in the table astmpk3 * 100. The final data in the table are as follows:

            k1 k2 k3 k4
            1 beijing 10000 1.1
            2 shanghai 20000 1.2
            3 guangzhou 30000 1.3
            null chongqing 40000 1.4
          2. Conditional column converting can be achieved through case when function.

            Suppose there are 4 columns k1,k2,k3,k4 in the table. We hope to load beijing, shanghai, guangzhou, chongqing in the source data after respectively being converted to the corresponding region ID:

            Column 1                  -> k1
            Column 2                  -> k2
            Column 3 after being converted to region id    -> k3
            Column 4                  -> k4

            Then the writing order of column mapping should be as follows:

            (k1, k2, tmpk3, k4, k3 = case tmpk3 when "beijing" then 1 when "shanghai" then 2 when "guangzhou" then 3 when "chongqing" then 4 else null end)

            The final data in the table are as follows:

            k1 k2 k3 k4
            1 100 1 1.1
            2 200 2 1.2
            3 300 3 1.3
            null 400 4 1.4
          3. Convert the null value in the source file to 0 and then load. And convert region id in example 2.

            Suppose there are 4 columns k1,k2,k3,k4 in the table. We wish to convert the null value in column K1 to 0 and then load when converting region id:

            Column 1 If null, convert to 0   -> k1
            Column 2                      -> k2
            Column 3                      -> k3
            Column 4                      -> k4

            Then the writing order of column mapping should be as follows:

            (tmpk1, k2, tmpk3, k4, k1 = ifnull(tmpk1, 0), k3 = case tmpk3 when "beijing" then 1 when "shanghai" then 2 when "guangzhou" then 3 when "chongqing" then 4 else null end)

            The final data in the table are as follows:

            k1 k2 k3 k4
            1 100 1 1.1
            2 200 2 1.2
            3 300 3 1.3
            0 400 4 1.4

          Column filter

          We can, after column mapping and converting, filter the data that we don't want to load into Palo through filtering conditions. Examples are as follows:

          Suppose there are 4 columns in the source file, as is shown below (The column name of the header is only for convenience, actually there is no header):

          Column 1 Column 2 Column 3 Column 4
          1 100 beijing 1.1
          2 200 shanghai 1.2
          3 300 guangzhou 1.3
          4 400 chongqing 1.4
          1. Filter directly in the case of column mapping and converting default

            Suppose there are 4 columns k1,k2,k3,k4 in the table. We can directly define filter conditions in the case of column mapping and converting default. If we want to load only the data rows with column 4 larger than 1.2 in the source file, the filter conditions are as follows:

            where k4 > 1.2

            The final data in the table are as follows:

            k1 k2 k3 k4
            3 300 guangzhou 1.3
            null 400 chongqing 1.4

            By default, Palo maps columns in order, so the fourth column in the source file is automatically mapped to column k4 in the table.

          2. Filter the data after column converting

            Suppose there are 4 columns k1,k2,k3,k4 in the table. We converted the province names to id in Column converting example. Here we want to filter out the data with id being 3. Converting and filter conditions are as follows:

            (k1, k2, tmpk3, k4, k3 = case tmpk3 when "beijing" then 1 when "shanghai" then 2 when "guangzhou" then 3 when "chongqing" then 4 else null end)
            where k3 != 3

            The final data in the table are as follows:

            k1 k2 k3 k4
            1 100 1 1.1
            2 200 2 1.2
            null 400 4 1.4

            Here we can see that the column values when filtering are the final column values after mapping and converting, not the original data.

          3. Multi-condition filtering

            Suppose there are 4 columns k1,k2,k3,k4 in the table. If we want to filter out the data whose column k1 is nulland the data whose column k4 is less than 1.2, the filter conditions are as follows:

            where k1 is null and k4 < 1.2

            The final data in the table are as follows:

            k1 k2 k3 k4
            2 200 2 1.2
            3 300 3 1.3

          Data quality issues and filter thresholds

          There are three types of data rows to be processed in the load job:

          1. Filtered Rows

            Data filtered out due to unqualified data quality. Unqualified data quality includes data format problems such as type error, precision error, super long string length, mismatched number of file columns, and data rows filtered out due to no corresponding sections.

          2. Unselected Rows

            Data rows filtered out due to filter conditions of columnspreceding filter or where.

          3. Loaded Rows

            Data rows that are correctly loaded.

          Palo load task allows users to set the maximum error rate (max_filter_ratio). If error rate of loaded data is lower than the threshold, these error rows will be ignored and other correct data will be loaded.

          How to calculate error rate:

          #Filtered Rows / (#Filtered Rows + #Loaded Rows)

          In other words, Unselected Rows will not participate in calculating error rate.

          Previous
          Load Transaction and Atomicity
          Next
          Strict Mode