Data Model
This document mainly describes the data model of Palo from the logical level to help users better use Palo to deal with different business scenarios.
Basic concepts
In Palo, data are described logically in the form of Table. A table includes rows and columns. A row is a row of data for a user. Column is used to describe different fields in a row of data.
Columns can be divided into two categories: Key and Value. From the perspective of the business, Key and Value can correspond to dimension column and indicator column respectively.
The data models of PALO can be divided into three categories
- Duplicate detail model
- Aggregate aggregation model
- Unique primary key model
Next we will introduce them separately.
Duplicate detail model
The detail model is the default data model used by PALO. The data model does not process the loaded data. The data in the table is the original data loaded by the user.
ColumnName | Type | SortKey | Comment |
---|---|---|---|
timestamp | DATETIME | Yes | Log time |
type | INT | Yes | Log type |
error_code | INT | Yes | Error code |
error_msg | VARCHAR(1024) | No | Error details |
op_id | BIGINT | No | id of person in charge |
op_time | DATETIME | No | Processing time |
The table creation statements are as follows:
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
`timestamp` DATETIME NOT NULL COMMENT "Log time",
`type` INT NOT NULL COMMENT "Log type",
`error_code` INT COMMENT "Error code",
`error_msg` VARCHAR(1024) COMMENT "Error details",
`op_id` BIGINT COMMENT "id of person in charge",
`op_time` DATETIME COMMENT "Processing time"
)
DUPLICATE KEY(`timestamp`, `type`)
... /* Omit Partition and Distribution information */
;
The DUPLICATE KEY specified in the table creation statements is only used to indicate which columns the underlying data are sorted by.(A more appropriate name should be "Sorted Column" , the name "DUPLICATE KEY" is here only used to clearly indicate the data model used. For more explanation of "Sorted Column", refer to Index document. ) In selecting DUPLICATE KEY, we suggest that the first 2-4 columns should be selected appropriately.
This data model is suitable for the storage of original data without aggregation requirements and primary key uniqueness constraint. Additionally, users can build aggregation view based on this model through materialized view function, so it is a recommended data model.
Aggregate aggregation model
Aggregation model requires users, when creating tables, to explicitly divide columns into Key columns and Value columns. The model will automatically aggregate the rows with the same Key on the Value column.
Let's illustrate, with practical example, what an aggregation model is and how to use it correctly.
Example 1: loading data aggregation
Suppose we have the following data table modes in the business:
ColumnName | Type | AggregationType | Comment |
---|---|---|---|
user_id | LARGEINT | User id | |
date | DATE | Data injection date | |
city | VARCHAR(20) | User's city | |
age | SMALLINT | User's age | |
sex | TINYINT | User's sex | |
last_visit_date | DATETIME | REPLACE | Last visit time of user |
cost | BIGINT | SUM | Total consumption of user |
max_dwell_time | INT | MAX | Maximum dwelling time of user |
min_dwell_time | INT | MIN | Minimum dwelling time of user |
Convert them to the following table creating statements (omit the partition and distribution information in the table creating statements)
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
`user_id` LARGEINT NOT NULL COMMENT "User id",
`date` DATE NOT NULL COMMENT "Data injection date",
`city` VARCHAR(20) COMMENT "User's city",
`age` SMALLINT COMMENT "User's age",
`sex` TINYINT COMMENT "User's sex",
`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "Last visit time of user",
`cost` BIGINT SUM DEFAULT "0" COMMENT "Total consumption of user",
`max_dwell_time` INT MAX DEFAULT "0" COMMENT "Maximum dwelling time of user",
`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum dwelling time of user",
)
AGGREGATE KEY(`user_id`, `date`, `timestamp`, `city`, `age`, `sex`)
... /* Omit Partition and Distribution information */
;
As we can see, this is a typical fact table of user information and visit behavior. In general star model, user information and visit behavior are stored in dimension table and fact table respectively. Here, we store the two parts of information in a table for more convenient explanation of PALO data model.
The columns in the table are divided into Key (dimension column) and Value (indicator column) according to whether AggregationType
is set. Columns without set AggregationType
such as user_id
, date
, age
are called Key, while columns with setAggregationType
are called Value.
In loading the data, the same row for the Key column is aggregated into one row, and the Value column is aggregated according to the set AggregationType
. Currently, there are four ways of aggregation in AggregationType
:
- SUM: Sum the Values of multiple rows.
- REPLACE: replace, the value in the next batch of data will replace the Value in the previously loaded row.
- MAX: reserve the maximum value.
- MIN: reserve the minimum value.
Suppose we have the following loaded data (original data): -
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
---|---|---|---|---|---|---|---|---|
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 06:00:00 | 20 | 10 | 10 |
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 15 | 2 | 2 |
10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
Suppose that this is a table that records the behaviors of users visiting a product page. Let's take the first row of data as an example to explain:
Data | Instructions |
---|---|
10000 | user id,unique id for each user |
2017-10-01 | Data storage time, accurate to date |
beijing | User's city |
20 | User's age |
0 | Male(1 is for female) |
2017-10-01 06:00:00 | The time that the user visits this page, accurate to seconds |
20 | Consumption generated by the user's current visit |
10 | The dwelling time of the user's visiting the page this time |
10 | The dwelling time of the user's visiting the page this time (redundancy) |
Then, load this batch of data correctly into Palo, and the final storage in Palo is as follows:
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
---|---|---|---|---|---|---|---|---|
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 35 | 10 | 2 |
10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
We can see that user 10000 has only one row of aggregated data left, while the data of other users are consistent with the original data. Here we first explain the aggregated data of user 10000:
There are no changes in the first 5 columns, the change is from column 6 last_visit_date
:
-
2017-10-01 07:00:00
: Because the aggregation method of the columnlast_visit_date is REPLACE, ``2017-10-01 06:00:00
is replaced by2017-10-01 07:00:00
and is saved.Note: for the data in the same loading batch, the replacement order is not guaranteed for REPLACE aggregation method. Probably
2017-10-01 06:00:00
will be saved finally in this example. For the data in different loading batches, it can be guaranteed that the data in the later batches will replace the data in the previous batches. 35
: Because the aggregate type ofcost
column is SUM, 35 is obtained by adding 20 + 15.10
: Because the aggregate type ofmax_dwell_time
column is MAX, 10 is obtained by taking the maximum value between 10 and 2 .2
: Because the aggregate type ofmin_dwell_time` column is MIN, 2 is obtained by taking the minimum value from 10 and 2.
Only the aggregated data will be stored in Palo after aggregation. In other words, detailed data will be lost and users can no longer query the detailed data before aggregation.
Example 2: reserving detail data
The structure of table, following example 1, is modified as follows:
ColumnName | Type | AggregationType | Comment |
---|---|---|---|
user_id | LARGEINT | User id | |
date | DATE | Data injection date | |
timestamp | DATETIME | Data injection time, accurate to seconds | |
city | VARCHAR(20) | User's city | |
age | SMALLINT | User's age | |
sex | TINYINT | User's sex | |
last_visit_date | DATETIME | REPLACE | Last visit time |
cost | BIGINT | SUM | Total consumption of user |
max_dwell_time | INT | MAX | Maximum dwelling time of user |
min_dwell_time | INT | MIN | Minimum dwelling time of user |
Now, a column of timestamp
is added to record the data injection time accurate to seconds.
Loaded data are as follows:
user_id | date | timestamp | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
---|---|---|---|---|---|---|---|---|---|
10000 | 2017-10-01 | 2017-10-01 08:00:05 | Beijing | 20 | 0 | 2017-10-01 06:00:00 | 20 | 10 | 10 |
10000 | 2017-10-01 | 2017-10-01 09:00:05 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 15 | 2 | 2 |
10001 | 2017-10-01 | 2017-10-01 18:12:10 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | 2017-10-02 13:10:00 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | 2017-10-02 13:15:00 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | 2017-10-01 12:12:48 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | 2017-10-03 12:38:20 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
Then, load this batch of data correctly into Palo, and the final storage in Palo is as follows:
user_id | date | timestamp | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
---|---|---|---|---|---|---|---|---|---|
10000 | 2017-10-01 | 2017-10-01 08:00:05 | Beijing | 20 | 0 | 2017-10-01 06:00:00 | 20 | 10 | 10 |
10000 | 2017-10-01 | 2017-10-01 09:00:05 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 15 | 2 | 2 |
10001 | 2017-10-01 | 2017-10-01 18:12:10 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | 2017-10-02 13:10:00 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | 2017-10-02 13:15:00 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | 2017-10-01 12:12:48 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | 2017-10-03 12:38:20 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
We can see that the stored data are exactly the same as the loaded data without any aggregation. This is because the timestamp
column is added to this batch of data, and the Keys of all rows are not exactly the same. In other words, as long as the Key of each row in the loaded data are not exactly the same, Palo can save the complete detail data even in the aggregation model.
Example 3: aggregation of loaded data and existing data
Following example 1, suppose that the data in the table are as follows:
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
---|---|---|---|---|---|---|---|---|
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 35 | 10 | 2 |
10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
Let's continue to load a new batch of data:
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
---|---|---|---|---|---|---|---|---|
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 11:22:00 | 44 | 19 | 19 |
10005 | 2017-10-03 | Changsha | 29 | 1 | 2017-10-03 18:11:02 | 3 | 1 | 1 |
Then, load this batch of data correctly into PALO, and the final storage in PALO is as follows:
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
---|---|---|---|---|---|---|---|---|
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 35 | 10 | 2 |
10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 11:22:00 | 55 | 19 | 6 |
10005 | 2017-10-03 | Changsha | 29 | 1 | 2017-10-03 18:11:02 | 3 | 1 | 1 |
It can be seen that the existing data of user 10004 and the newly-loaded data are aggregated with the data of the user 10005 added.
There are three stages of data aggregation in PALO:
- ETL stage of each batch data loading. This stage aggregates the data loaded in each batch.
- The stage when the underlying BE performs data Comparison. In this stage, BE will further aggregate the loaded data of different batches.
- Data query stage. In data query, the data involved in the query will be aggregated.
The degree of data aggregation may be inconsistent at different times. When a batch of data are loaded, for example, they may not have been aggregated with the existing data. But for users, users can only query aggregated data, which means that different aggregation degrees are transparent for user queries. Users should always think that the data exists with the degree of final aggregation, and should not assume that some aggregation has not occurred.(Please refer to the section of limitations of aggregation model for more details.)
Unique primary key model
Users, in some multi-dimensional analysis scenarios, pay more attention to how to ensure the uniqueness of the Key, that is, how to obtain the uniqueness constraint of the Primary Key. Therefore, we introduce Unique data model, which is, in essence, a special case of aggregation model and a simplified table structure representation. Here are the examples:
ColumnName | Type | IsKey | Comment |
---|---|---|---|
user_id | BIGINT | Yes | User id |
username | VARCHAR(50) | Yes | User's nickname |
city | VARCHAR(20) | No | User's city |
age | SMALLINT | No | User's age |
sex | TINYINT | No | User's sex |
phone | LARGEINT | No | User's phone number |
address | VARCHAR(500) | No | User's address |
register_time | DATETIME | No | User registration time |
This is a typical table of user base information. There is no aggregation requirement for this kind of data, and they just need to ensure the uniqueness of the primary key.(The primary key here is user_id + username).Then our table creation statements are as follows:
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
`user_id` LARGEINT NOT NULL COMMENT " User id",
`username` VARCHAR(50) NOT NULL COMMENT "User's nickname",
`city` VARCHAR(20) COMMENT "User's city",
`age` SMALLINT COMMENT "User's age",
`sex` TINYINT COMMENT "User's sex",
`phone` LARGEINT COMMENT "User's phone number",
`address` VARCHAR(500) COMMENT "User's address",
`register_time` DATETIME COMMENT "User registration time "
)
UNIQUE KEY(`user_id`, `user_name`)
... /* Omit Partition and Distribution information */
;
This table structure is exactly the same as the following table structure described by the aggregation model:
ColumnName | Type | AggregationType | Comment |
---|---|---|---|
user_id | BIGINT | User id | |
username | VARCHAR(50) | User's nickname | |
city | VARCHAR(20) | REPLACE | User's city |
age | SMALLINT | REPLACE | User's age |
sex | TINYINT | REPLACE | User's sex |
phone | LARGEINT | REPLACE | User's phone number |
address | VARCHAR(500) | REPLACE | User's address |
register_time | DATETIME | REPLACE | User registration time |
And the table creation statements are as follows:
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
`user_id` LARGEINT NOT NULL COMMENT "User id",
`username` VARCHAR(50) NOT NULL COMMENT "User's nickname",
`city` VARCHAR(20) REPLACE COMMENT "User's city",
`age` SMALLINT REPLACE COMMENT "User's age",
`sex` TINYINT REPLACE COMMENT "User's sex",
`phone` LARGEINT REPLACE COMMENT "User's phone number",
`address` VARCHAR(500) REPLACE COMMENT "User's address",
`register_time` DATETIME REPLACE COMMENT "User registration time"
)
AGGREGATE KEY(`user_id`, `user_name`)
... /* Omit Partition and Distribution information */
;
That is, Unique model can be fully replaced by REPLACE in the aggregation model. Its internal implementation and data storage are exactly the same. No examples here will be given.
Limitations of aggregation model
Aggregation model (including Unique model) , through a precomputing method, can reduce data volume that need to be calculated in real time and speed up the query. The model, however, has limitations in use.
In the aggregation model, the model displays data after final aggregation for external presentation. In other words, any data that have not yet been aggregated (for example, data from two different loaded batches) must be displayed in a certain way to ensure consistency. Here are the examples:
Suppose the table structure is as follows:
ColumnName | Type | AggregationType | Comment |
---|---|---|---|
user_id | LARGEINT | User id | |
date | DATE | Data injection date | |
cost | BIGINT | SUM | Total user consumption |
Suppose that the storage engine has the following two batches of data that have been loaded:
batch 1
user_id | date | cost |
---|---|---|
10001 | 2017-11-20 | 50 |
10002 | 2017-11-21 | 39 |
batch 2
user_id | date | cost |
---|---|---|
10001 | 2017-11-20 | 1 |
10001 | 2017-11-21 | 5 |
10003 | 2017-11-22 | 22 |
We can see that the data of user 10001 in the two loading batches have not been aggregated. However, in order to ensure that users can only query the final aggregated data as follows:
user_id | date | cost |
---|---|---|
10001 | 2017-11-20 | 51 |
10001 | 2017-11-21 | 5 |
10002 | 2017-11-21 | 39 |
10003 | 2017-11-22 | 22 |
We add aggregation operator to query engine to ensure the external consistency of data.
In addition, on the aggregate column (Value), when executing an aggregate class query that is inconsistent with the aggregate type, semantics should be paid attention to. For example, we execute the following query in the above example:
SELECT MIN(cost) FROM table;
The result is 5, not 1.
At the same time, this consistency guarantee will greatly reduce the query efficiency in some queries.
Let's take the most basic count () query as an example:
SELECT COUNT(*) FROM table;
In other databases, this kind of query will return results quickly. Because in the implementation, we can "count rows and save count statistics" when loading, or we can "scan only a column of data to get count value" when querying, we can get the query results with little overhead. However, in Palo's aggregation model, the overhead of this kind of query is rather large.
Let's take the above data as an example
batch 1
user_id | date | cost |
---|---|---|
10001 | 2017-11-20 | 50 |
10002 | 2017-11-21 | 39 |
batch 2
user_id | date | cost |
---|---|---|
10001 | 2017-11-20 | 1 |
10001 | 2017-11-21 | 5 |
10003 | 2017-11-22 | 22 |
Because the final aggregation result is:
user_id | date | cost |
---|---|---|
10001 | 2017-11-20 | 51 |
10001 | 2017-11-21 | 5 |
10002 | 2017-11-21 | 39 |
10003 | 2017-11-22 | 22 |
The correct result of select count(*) from table;
should be 4. But if we just scan user_id
column and add query aggregation to this column, the final result is 3 (10001, 10002, 10003). If the query aggregation is not added, the result is 5 (five rows of data in two batches). It can be seen that both results are incorrect.
In order to get the correct results, we must read the values of columns user_id and date simultaneously, plus aggregation during query, then the correct result of 4 can be returned . In other words, in the count(*) query, Palo must scan all the AGGREGATE KEY columns (here are user_id and date), and only after aggregation can we get the semantic correct results. When there are too many aggregate columns, the count () query needs to scan a large amount of data.
Therefore, when there are frequent count (\ *) queries in the business, we suggest that users simulate count(*) by adding a column whose value is always 1 and aggregate type is SUM. As for the table structure in the example just now, we modify it as follows:
ColumnName | Type | AggregateType | Comment |
---|---|---|---|
user_id | BIGINT | User id | |
date | DATE | Data injection date | |
cost | BIGINT | SUM | Total user consumption |
count | BIGINT | SUM | Used to count |
Add a count column and load the data, the value of the column is always 1. Then the result of select count(*) from table;
is equivalent to select sum(count) from table;
. But the query efficiency of the latter is much higher than that of the former. However, this method also has limitations, that is, users need to guarantee that they will not repeatedly load rows with the same AGGREGATE KEY columns. Otherwise, select sum (count) from table;
can only express the number of original loaded rows, not the semantics of select count(*) from table;
.
Another way is to change the aggregation type of the above count
column to REPLACE, and the value is still always 1. Then the results of select sum(count) from table;
' and select count(*) from table;
will be consistent. And in this way, there is no restriction on loading duplicate rows.
Duplicate model
Duplicate model does not have this limitation of the aggregate model. Because the model does not involve aggregation semantics, when doing count () query, you can get the result with correct semantics by selecting any query column.
Suggestions on the selection of data models
Because the data model has been determined at the time of table creation and cannot be modified. Therefore, it is very important to choose an appropriate data model.
- Aggregate model can greatly reduce the data volume to be scanned and the amount of query computation by pre-aggregation, which is very suitable for the report query scenarios with fixed patterns. However, the model is not friendly to count (*) query. Also, because the aggregation method on the Value column is fixed, semantic correctness should be considered when other types of aggregation queries are carried out.
- Unique model can guarantee the uniqueness constraint of the primary key for the scenarios that need the unique primary key constraint. But we can't take advantage of the query advantage brought by Rollup and other pre-aggregation (because the essence is REPLACE and there is no SUM aggregation method).
- Duplicate is suitable for Ad-hoc queries of any dimension. Although it is also unable to take advantage of the pre-aggregation feature, it is not constrained by the aggregation model, and can take advantage of the column storage model (only read related columns, but not all Key columns).