Baidu AI Cloud
中国站

百度智能云

Data Warehouse

Load JSON Data

Palo supports loading data in JSON format. This document mainly describes the precautions when loading JSON format data.

Supported load methods

Currently, only the following load methods support loading data in JSON format:

  • load the local file in JSON format through STREAM LOAD.
  • Subscribe and consume messages in JSON format in Kafka through ROUNTINE LOAD.

Other methods of data load in JSON format are not supported for the time being.

Supported Json format

Currently, only the following two Json formats are supported:

  1. Multi-row data represented by Array

    Json format with Array as root node. Each element in Array represents a row of data to be loaded, which is usually an Object. Examples are as follows:

    [
        { "id": 123, "city" : "beijing"},
        { "id": 456, "city" : "shanghai"},
        ...
    ]
    [
        { "id": 123, "city" : { "name" : "beijing", "region" : "haidian"}},
        { "id": 456, "city" : { "name" : "beijing", "region" : "chaoyang"}},
        ...
    ]

    This method is usually used in Stream Load mode to represent multiple rows of data in a batch of loaded data.

    The use of this method must be accompanied with the setting of stripe_outer_array=true. During parsing, Palo will expand the array and then parse each Object as a row of data in turn.

  2. Single row data represented by Object

    Json format with Object as root node. The entire Object represents a row of data to be loaded. Examples are as follows:

    { "id": 123, "city" : "beijing"}
    { "id": 123, "city" : { "name" : "beijing", "region" : "haidian" }}

    This method is usually used for Routine Load method, for example, representing a message in Kafka, i.e., a row of data. ​

fuzzy_parse parameter

In STREAM LOAD , the user can add fuzzy_parse parameter to accelerate load efficiency of JSON data.

This parameter is usually used to load multi-row data represented by Array as the format, so it is usually combined with strip_outer_array=true to use.

This function requires that in each data row of Array, the field order be completely consistent. Palo will only parse according to the field order in the first row, and then access the subsequent data in the form of subscripts. This method can improve the load efficiency by 3-5 times.

Json Path

Palo supports extracting specified data in Json through Json path.

Note: for Array type data, Palo will first expand the array, and finally perform single row processing according to Object format. Therefore, the examples after this document are illustrated with Json data in single Object format.

  • Json Path is not specified

    If there is no specified Json path, Palo will, by default, search the elements in Object with the column names in the table. Examples are as follows:

    The table contains two columns: id, city

    Json data are as follows:

    { "id": 123, "city" : "beijing"}

    Then Palo will match with id, city to get the final data 123 and beijing.

    If Json data are as follows:

    { "id": 123, "name" : "beijing"}

    Then Palo will match with id, city to get the final data 123 and null.

  • Json Path is specified

    Specify a set of Json Paths in the form of Json data, Each element in the array represents a column to be extracted. Examples are as follows:

    ["$.id", "$.name"]
    ["$.id.sub_id", "$.name[0]", "$.city[0]"]

    Palo will use the specified Json Path for data matching and extraction.

  • Match non-basic types

    The final matching values in the previous examples are all basic types, such as integer, string, etc. Palo currently does not support composite types like Array, Map, etc. So when matching a non-basic type, Palo will convert the type to a string in Json format and load it as a string type. Examples are as follows:

    Json data are:

    { "id": 123, "city" : { "name" : "beijing", "region" : "haidian" }}

    Json Path is ["$.city"], then the matched elements are:

    { "name" : "beijing", "region" : "haidian" }

    The element will be converted to a string for subsequent load:

    "{'name':'beijing','region':'haidian'}"
  • Match failed

    When the match fails, the value will return tonull. Examples are as follows:

    Json data are:

    { "id": 123, "name" : "beijing"}

    Json Path is ["$.id", "$.info"]. then the matched elements are: 123 and null.

    Palo currently does not distinguish between null values expressed in Json data and the ones generated from failed match. Suppose that Jason data are:

    { "id": 123, "name" : null }

    then the same results 123 and nullwill be acquired when using the following two types of Json Path.

    ["$.id", "$.name"]
    ["$.id", "$.info"]
  • Complete failed match

    In order to prevent misoperation caused by some parameter setting errors, when Palo tries to match a row of data and if all the columns fail to match, the row will be considered as an error row. Suppose that Json data are:

    { "id": 123, "city" : "beijing" }

    If Json Path is incorrectly written as (or if Json Path is not specified, the columns in the table do not contain id and city):

    ["$.ad", "$.infa"]

    then it will lead to a complete failed match, and the row will be marked as an error row instead of producing null, null.

Json Path and Columns

Json Path is used to specify how to extract data in Json format, while Columns specifies the mapping and converting relationship of columns. The two can be used in combination.

In other words, it is equivalent to rearrange the columns of data in Json format according to the column order specified in Json Path through Json Path. Afterwards, mapping can be achieved for the rearranged source data and the columns of the table through Columns. Examples are as follows:

Data contents:

{"k1" : 1, "k2": 2}

Table structure:

k2 int, k1 int

load statement 1 (Take Stream Load as an example) :

curl -v --location-trusted -u root: -H "format: json" -H "jsonpaths: [\"$.k2\", \"$.k1\"]" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load

In the load statement 1, only Json Path is specified while Columns is not. And the function of Json Path is to extract the Json data according to the field order in Json Path, and then write them according to the order of the table structure. The final result of the loaded data is as follows:

+------+------+
| k1   | k2   |
+------+------+
|    2 |    1 |
+------+------+

It can be seen that actually k1 column is loaded with the "k2" column value in Json data. The reason for this is that the filed name in Json is different from that in the table structure. We need to explicitly specify the mapping relationship between the two.

load statement 2:

curl -v --location-trusted -u root: -H "format: json" -H "jsonpaths: [\"$.k2\", \"$.k1\"]" -H "columns: k2, k1" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load

Compared with the load statement 1, the Columns field is added here to show the mapping relationship of columns in the order of k2, k1, which means that after extracting data in the order of fields in Json Path , the value of the first column is specified as that of column k2 in the table, and the value of the second column is specified as that of column k1 in the table. The final result of the load data is as follows:

+------+------+
| k1   | k2   |
+------+------+
|    1 |    2 |
+------+------+

Of course, converting column in Columns, like other loads, can be performed. Examples are as follows:

curl -v --location-trusted -u root: -H "format: json" -H "jsonpaths: [\"$.k2\", \"$.k1\"]" -H "columns: k2, tmp_k1, k1 = tmp_k1 * 100" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load

The above example will load k1 value after multiplying it by 100. The final result of the load data is as follows:

+------+------+
| k1   | k2   |
+------+------+
|  100 |    2 |
+------+------+

The values of NULL and Default

The example data are as follows:

[
    {"k1": 1, "k2": "a"},
    {"k1": 2},
    {"k1": 3, "k2": "c"},
]

The table structure is: k1 int null, k2 varchar(32) null default "x"

load statement is as follows:

curl -v --location-trusted -u root: -H "format: json" -H "strip_outer_array: true" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load

The user may expect the following load results, for the missing columns, fill in the default values.

+------+------+
| k1   | k2   |
+------+------+
|    1 |    a |
+------+------+
|    2 |    x |
+------+------+
|    3 |    c |
+------+------+

The actual load result, however, is as follows, which means NULL is added to the missing column.

+------+------+
| k1   | k2   |
+------+------+
|    1 |    a |
+------+------+
|    2 | NULL |
+------+------+
|    3 |    c |
+------+------+

The reason for this is that Palo, through the information of load statement, does not know that "the missing column is the k2 column in the table" . If the user wants to load the above data according to the expected result, the load statement should be as follows:

curl -v --location-trusted -u root: -H "format: json" -H "strip_outer_array: true" -H "jsonpaths: [\"$.k1\", \"$.k2\"]" -H "columns: k1, tmp_k2, k2 = ifnull(tmp_k2, 'x')" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load

Application examples

Stream Load

Due to the non-splitting property of the Json format, when loading files in the Json format with Stream Load, processing will start after the whole file contents is loaded into the memory. Therefore, if the file is too large, it may take up more memory.

Suppose the table structure is:

id      INT     NOT NULL,
city    VARHCAR NULL,
code    INT     NULL
  1. load single row data 1

    {"id": 100, "city": "beijing", "code" : 1}
    • Json Path is not specified

      curl --location-trusted -u user:passwd -H "format: json" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load

      load result:

      100     beijing     1
    • Json Path is specified

      curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\"$.id\",\"$.city\",\"$.code\"]" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load

      load result:

      100     beijing     1
  2. load single row data 2

    {"id": 100, "content": {"city": "beijing", "code" : 1}}
    • Json Path is specified
    curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\"$.id\",\"$.content.city\",\"$.content.code\"]" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load

    load result:

    100     beijing     1
  3. load multi row data

    [
        {"id": 100, "city": "beijing", "code" : 1},
        {"id": 101, "city": "shanghai"},
        {"id": 102, "city": "tianjin", "code" : 3},
        {"id": 103, "city": "chongqing", "code" : 4},
        {"id": 104, "city": ["zhejiang", "guangzhou"], "code" : 5},
        {
            "id": 105,
            "city": {
                "order1": ["guangzhou"]
            }, 
            "code" : 6
        }
    ]
    • Json Path is specified

      curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\"$.id\",\"$.city\",\"$.code\"]" -H "strip_outer_array: true" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load

      load result:

      100     beijing                     1
      101     shanghai                    NULL
      102     tianjin                     3
      103     chongqing                   4
      104     ["zhejiang","guangzhou"]    5
      105     {"order1":["guangzhou"]}    6
  4. Convert loaded data

    The data are still the multi row data in example 3, now the user needs to add 1 to code column in the load data to load.

    curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\"$.id\",\"$.city\",\"$.code\"]" -H "strip_outer_array: true" -H "columns: id, city, tmpc, code=tmpc+1" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load

    load result: ​

    100     beijing                     2
    101     shanghai                    NULL
    102     tianjin                     4
    103     chongqing                   5
    104     ["zhejiang","guangzhou"]    6
    105     {"order1":["guangzhou"]}    7

Routine Load

The principle of processing Json data by Route Load is the same as that by Stream Load, which will not be repeated here.

For Kafka data sources, the content in each Message is regarded as a complete Json data. If there are multiple rows data in Array format in a Massage, multiple rows will be loaded, and the offset of Kafka will only be added by 1. While if an Array format Json represents multiple rows data, the error row will be only added by 1 if parsing of Json fails because of the wrong Json format (Palo can't actually judge how many rows of data it contains because of the parsing failure, it can only record one row of error data).

Previous
Synchronize Data Through External Table
Next
Load Transaction and Atomicity