Load JSON Data
Palo supports loading data in JSON format. This document mainly describes the precautions when loading JSON format data.
Supported load methods
Currently, only the following load methods support loading data in JSON format:
- load the local file in JSON format through STREAM LOAD.
- Subscribe and consume messages in JSON format in Kafka through ROUNTINE LOAD.
Other methods of data load in JSON format are not supported for the time being.
Supported Json format
Currently, only the following two Json formats are supported:
-
Multi-row data represented by Array
Json format with Array as root node. Each element in Array represents a row of data to be loaded, which is usually an Object. Examples are as follows:
[ { "id": 123, "city" : "beijing"}, { "id": 456, "city" : "shanghai"}, ... ]
[ { "id": 123, "city" : { "name" : "beijing", "region" : "haidian"}}, { "id": 456, "city" : { "name" : "beijing", "region" : "chaoyang"}}, ... ]
This method is usually used in Stream Load mode to represent multiple rows of data in a batch of loaded data.
The use of this method must be accompanied with the setting of
stripe_outer_array=true
. During parsing, Palo will expand the array and then parse each Object as a row of data in turn. -
Single row data represented by Object
Json format with Object as root node. The entire Object represents a row of data to be loaded. Examples are as follows:
{ "id": 123, "city" : "beijing"}
{ "id": 123, "city" : { "name" : "beijing", "region" : "haidian" }}
This method is usually used for Routine Load method, for example, representing a message in Kafka, i.e., a row of data.
fuzzy_parse parameter
In STREAM LOAD , the user can add fuzzy_parse
parameter to accelerate load efficiency of JSON data.
This parameter is usually used to load multi-row data represented by Array as the format, so it is usually combined with strip_outer_array=true
to use.
This function requires that in each data row of Array, the field order be completely consistent. Palo will only parse according to the field order in the first row, and then access the subsequent data in the form of subscripts. This method can improve the load efficiency by 3-5 times.
Json Path
Palo supports extracting specified data in Json through Json path.
Note: for Array type data, Palo will first expand the array, and finally perform single row processing according to Object format. Therefore, the examples after this document are illustrated with Json data in single Object format.
-
Json Path is not specified
If there is no specified Json path, Palo will, by default, search the elements in Object with the column names in the table. Examples are as follows:
The table contains two columns:
id
,city
Json data are as follows:
{ "id": 123, "city" : "beijing"}
Then Palo will match with
id
,city
to get the final data123
andbeijing
.If Json data are as follows:
{ "id": 123, "name" : "beijing"}
Then Palo will match with
id
,city
to get the final data123
andnull.
-
Json Path is specified
Specify a set of Json Paths in the form of Json data, Each element in the array represents a column to be extracted. Examples are as follows:
["$.id", "$.name"]
["$.id.sub_id", "$.name[0]", "$.city[0]"]
Palo will use the specified Json Path for data matching and extraction.
-
Match non-basic types
The final matching values in the previous examples are all basic types, such as integer, string, etc. Palo currently does not support composite types like Array, Map, etc. So when matching a non-basic type, Palo will convert the type to a string in Json format and load it as a string type. Examples are as follows:
Json data are:
{ "id": 123, "city" : { "name" : "beijing", "region" : "haidian" }}
Json Path is
["$.city"]
, then the matched elements are:{ "name" : "beijing", "region" : "haidian" }
The element will be converted to a string for subsequent load:
"{'name':'beijing','region':'haidian'}"
-
Match failed
When the match fails, the value will return to
null
. Examples are as follows:Json data are:
{ "id": 123, "name" : "beijing"}
Json Path is
["$.id", "$.info"]
. then the matched elements are:123
andnull
.Palo currently does not distinguish between null values expressed in Json data and the ones generated from failed match. Suppose that Jason data are:
{ "id": 123, "name" : null }
then the same results
123
andnull
will be acquired when using the following two types of Json Path.["$.id", "$.name"]
["$.id", "$.info"]
-
Complete failed match
In order to prevent misoperation caused by some parameter setting errors, when Palo tries to match a row of data and if all the columns fail to match, the row will be considered as an error row. Suppose that Json data are:
{ "id": 123, "city" : "beijing" }
If Json Path is incorrectly written as (or if Json Path is not specified, the columns in the table do not contain
id
andcity
):["$.ad", "$.infa"]
then it will lead to a complete failed match, and the row will be marked as an error row instead of producing
null, null
.
Json Path and Columns
Json Path is used to specify how to extract data in Json format, while Columns specifies the mapping and converting relationship of columns. The two can be used in combination.
In other words, it is equivalent to rearrange the columns of data in Json format according to the column order specified in Json Path through Json Path. Afterwards, mapping can be achieved for the rearranged source data and the columns of the table through Columns. Examples are as follows:
Data contents:
{"k1" : 1, "k2": 2}
Table structure:
k2 int, k1 int
load statement 1 (Take Stream Load as an example) :
curl -v --location-trusted -u root: -H "format: json" -H "jsonpaths: [\"$.k2\", \"$.k1\"]" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load
In the load statement 1, only Json Path is specified while Columns is not. And the function of Json Path is to extract the Json data according to the field order in Json Path, and then write them according to the order of the table structure. The final result of the loaded data is as follows:
+------+------+
| k1 | k2 |
+------+------+
| 2 | 1 |
+------+------+
It can be seen that actually k1 column is loaded with the "k2" column value in Json data. The reason for this is that the filed name in Json is different from that in the table structure. We need to explicitly specify the mapping relationship between the two.
load statement 2:
curl -v --location-trusted -u root: -H "format: json" -H "jsonpaths: [\"$.k2\", \"$.k1\"]" -H "columns: k2, k1" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load
Compared with the load statement 1, the Columns field is added here to show the mapping relationship of columns in the order of k2, k1
, which means that after extracting data in the order of fields in Json Path , the value of the first column is specified as that of column k2 in the table, and the value of the second column is specified as that of column k1 in the table. The final result of the load data is as follows:
+------+------+
| k1 | k2 |
+------+------+
| 1 | 2 |
+------+------+
Of course, converting column in Columns, like other loads, can be performed. Examples are as follows:
curl -v --location-trusted -u root: -H "format: json" -H "jsonpaths: [\"$.k2\", \"$.k1\"]" -H "columns: k2, tmp_k1, k1 = tmp_k1 * 100" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load
The above example will load k1 value after multiplying it by 100. The final result of the load data is as follows:
+------+------+
| k1 | k2 |
+------+------+
| 100 | 2 |
+------+------+
The values of NULL and Default
The example data are as follows:
[
{"k1": 1, "k2": "a"},
{"k1": 2},
{"k1": 3, "k2": "c"},
]
The table structure is: k1 int null, k2 varchar(32) null default "x"
load statement is as follows:
curl -v --location-trusted -u root: -H "format: json" -H "strip_outer_array: true" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load
The user may expect the following load results, for the missing columns, fill in the default values.
+------+------+
| k1 | k2 |
+------+------+
| 1 | a |
+------+------+
| 2 | x |
+------+------+
| 3 | c |
+------+------+
The actual load result, however, is as follows, which means NULL is added to the missing column.
+------+------+
| k1 | k2 |
+------+------+
| 1 | a |
+------+------+
| 2 | NULL |
+------+------+
| 3 | c |
+------+------+
The reason for this is that Palo, through the information of load statement, does not know that "the missing column is the k2 column in the table" . If the user wants to load the above data according to the expected result, the load statement should be as follows:
curl -v --location-trusted -u root: -H "format: json" -H "strip_outer_array: true" -H "jsonpaths: [\"$.k1\", \"$.k2\"]" -H "columns: k1, tmp_k2, k2 = ifnull(tmp_k2, 'x')" -T example.json http://127.0.0.1:8030/api/db1/tbl1/_stream_load
Application examples
Stream Load
Due to the non-splitting property of the Json format, when loading files in the Json format with Stream Load, processing will start after the whole file contents is loaded into the memory. Therefore, if the file is too large, it may take up more memory.
Suppose the table structure is:
id INT NOT NULL,
city VARHCAR NULL,
code INT NULL
-
load single row data 1
{"id": 100, "city": "beijing", "code" : 1}
-
Json Path is not specified
curl --location-trusted -u user:passwd -H "format: json" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load
load result:
100 beijing 1
-
Json Path is specified
curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\"$.id\",\"$.city\",\"$.code\"]" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load
load result:
100 beijing 1
-
-
load single row data 2
{"id": 100, "content": {"city": "beijing", "code" : 1}}
- Json Path is specified
curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\"$.id\",\"$.content.city\",\"$.content.code\"]" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load
load result:
100 beijing 1
-
load multi row data
[ {"id": 100, "city": "beijing", "code" : 1}, {"id": 101, "city": "shanghai"}, {"id": 102, "city": "tianjin", "code" : 3}, {"id": 103, "city": "chongqing", "code" : 4}, {"id": 104, "city": ["zhejiang", "guangzhou"], "code" : 5}, { "id": 105, "city": { "order1": ["guangzhou"] }, "code" : 6 } ]
-
Json Path is specified
curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\"$.id\",\"$.city\",\"$.code\"]" -H "strip_outer_array: true" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load
load result:
100 beijing 1 101 shanghai NULL 102 tianjin 3 103 chongqing 4 104 ["zhejiang","guangzhou"] 5 105 {"order1":["guangzhou"]} 6
-
-
Convert loaded data
The data are still the multi row data in example 3, now the user needs to add 1 to
code
column in the load data to load.curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\"$.id\",\"$.city\",\"$.code\"]" -H "strip_outer_array: true" -H "columns: id, city, tmpc, code=tmpc+1" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load
load result:
100 beijing 2 101 shanghai NULL 102 tianjin 4 103 chongqing 5 104 ["zhejiang","guangzhou"] 6 105 {"order1":["guangzhou"]} 7
Routine Load
The principle of processing Json data by Route Load is the same as that by Stream Load, which will not be repeated here.
For Kafka data sources, the content in each Message is regarded as a complete Json data. If there are multiple rows data in Array format in a Massage, multiple rows will be loaded, and the offset of Kafka will only be added by 1. While if an Array format Json represents multiple rows data, the error row will be only added by 1 if parsing of Json fails because of the wrong Json format (Palo can't actually judge how many rows of data it contains because of the parsing failure, it can only record one row of error data).