Baidu AI Cloud
中国站

百度智能云

Data Warehouse

Data Model

This document mainly describes the data model of Palo from the logical level to help users better use Palo to deal with different business scenarios.

Basic concepts

In Palo, data are described logically in the form of Table. A table includes rows and columns. A row is a row of data for a user. Column is used to describe different fields in a row of data.

Columns can be divided into two categories: Key and Value. From the perspective of the business, Key and Value can correspond to dimension column and indicator column respectively.

The data models of PALO can be divided into three categories

  • Duplicate detail model
  • Aggregate aggregation model
  • Unique primary key model

Next we will introduce them separately.

Duplicate detail model

The detail model is the default data model used by PALO. The data model does not process the loaded data. The data in the table is the original data loaded by the user.

ColumnName Type SortKey Comment
timestamp DATETIME Yes Log time
type INT Yes Log type
error_code INT Yes Error code
error_msg VARCHAR(1024) No Error details
op_id BIGINT No id of person in charge
op_time DATETIME No Processing time

The table creation statements are as follows:

CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(  
    `timestamp` DATETIME NOT NULL COMMENT "Log time",
    `type` INT NOT NULL COMMENT "Log type",
    `error_code` INT COMMENT "Error code",
    `error_msg` VARCHAR(1024) COMMENT "Error details",
    `op_id` BIGINT COMMENT "id of person in charge",
    `op_time` DATETIME COMMENT "Processing time"
)
DUPLICATE KEY(`timestamp`, `type`)
... /* Omit Partition and Distribution information */

The DUPLICATE KEY specified in the table creation statements is only used to indicate which columns the underlying data are sorted by.(A more appropriate name should be "Sorted Column" , the name "DUPLICATE KEY" is here only used to clearly indicate the data model used. For more explanation of "Sorted Column", refer to Index document. ) In selecting DUPLICATE KEY, we suggest that the first 2-4 columns should be selected appropriately.

This data model is suitable for the storage of original data without aggregation requirements and primary key uniqueness constraint. Additionally, users can build aggregation view based on this model through materialized view function, so it is a recommended data model.

Aggregate aggregation model

Aggregation model requires users, when creating tables, to explicitly divide columns into Key columns and Value columns. The model will automatically aggregate the rows with the same Key on the Value column.

Let's illustrate, with practical example, what an aggregation model is and how to use it correctly.

Example 1: loading data aggregation

Suppose we have the following data table modes in the business:

ColumnName Type AggregationType Comment
user_id LARGEINT User id
date DATE Data injection date
city VARCHAR(20) User's city
age SMALLINT User's age
sex TINYINT User's sex
last_visit_date DATETIME REPLACE Last visit time of user
cost BIGINT SUM Total consumption of user
max_dwell_time INT MAX Maximum dwelling time of user
min_dwell_time INT MIN Minimum dwelling time of user

Convert them to the following table creating statements (omit the partition and distribution information in the table creating statements)

CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
    `user_id` LARGEINT NOT NULL COMMENT "User id",
    `date` DATE NOT NULL COMMENT "Data injection date",
    `city` VARCHAR(20) COMMENT "User's city",
    `age` SMALLINT COMMENT "User's age",
    `sex` TINYINT COMMENT "User's sex",
    `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "Last visit time of user",
    `cost` BIGINT SUM DEFAULT "0" COMMENT "Total consumption of user",
    `max_dwell_time` INT MAX DEFAULT "0" COMMENT "Maximum dwelling time of user",
    `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum dwelling time of user",
)
AGGREGATE KEY(`user_id`, `date`, `timestamp`, `city`, `age`, `sex`)
... /* Omit Partition and Distribution information */

As we can see, this is a typical fact table of user information and visit behavior. In general star model, user information and visit behavior are stored in dimension table and fact table respectively. Here, we store the two parts of information in a table for more convenient explanation of PALO data model.

The columns in the table are divided into Key (dimension column) and Value (indicator column) according to whether AggregationType is set. Columns without set AggregationType such as user_id, date, age are called Key, while columns with setAggregationType are called Value.

In loading the data, the same row for the Key column is aggregated into one row, and the Value column is aggregated according to the set AggregationType. Currently, there are four ways of aggregation in AggregationType:

  1. SUM: Sum the Values of multiple rows.
  2. REPLACE: replace, the value in the next batch of data will replace the Value in the previously loaded row.
  3. MAX: reserve the maximum value.
  4. MIN: reserve the minimum value.

Suppose we have the following loaded data (original data): -

user_id date city age sex last_visit_date cost max_dwell_time min_dwell_time
10000 2017-10-01 Beijing 20 0 2017-10-01 06:00:00 20 10 10
10000 2017-10-01 Beijing 20 0 2017-10-01 07:00:00 15 2 2
10001 2017-10-01 Beijing 30 1 2017-10-01 17:05:45 2 22 22
10002 2017-10-02 Shanghai 20 1 2017-10-02 12:59:12 200 5 5
10003 2017-10-02 Guangzhou 32 0 2017-10-02 11:20:00 30 11 11
10004 2017-10-01 Shenzhen 35 0 2017-10-01 10:00:15 100 3 3
10004 2017-10-03 Shenzhen 35 0 2017-10-03 10:20:22 11 6 6

Suppose that this is a table that records the behaviors of users visiting a product page. Let's take the first row of data as an example to explain:

Data Instructions
10000 user id,unique id for each user
2017-10-01 Data storage time, accurate to date
beijing User's city
20 User's age
0 Male(1 is for female)
2017-10-01 06:00:00 The time that the user visits this page, accurate to seconds
20 Consumption generated by the user's current visit
10 The dwelling time of the user's visiting the page this time
10 The dwelling time of the user's visiting the page this time (redundancy)

Then, load this batch of data correctly into Palo, and the final storage in Palo is as follows:

user_id date city age sex last_visit_date cost max_dwell_time min_dwell_time
10000 2017-10-01 Beijing 20 0 2017-10-01 07:00:00 35 10 2
10001 2017-10-01 Beijing 30 1 2017-10-01 17:05:45 2 22 22
10002 2017-10-02 Shanghai 20 1 2017-10-02 12:59:12 200 5 5
10003 2017-10-02 Guangzhou 32 0 2017-10-02 11:20:00 30 11 11
10004 2017-10-01 Shenzhen 35 0 2017-10-01 10:00:15 100 3 3
10004 2017-10-03 Shenzhen 35 0 2017-10-03 10:20:22 11 6 6

We can see that user 10000 has only one row of aggregated data left, while the data of other users are consistent with the original data. Here we first explain the aggregated data of user 10000:

There are no changes in the first 5 columns, the change is from column 6 last_visit_date :

  • 2017-10-01 07:00:00: Because the aggregation method of the column last_visit_date is REPLACE, ``2017-10-01 06:00:00 is replaced by 2017-10-01 07:00:00 and is saved.

    Note: for the data in the same loading batch, the replacement order is not guaranteed for REPLACE aggregation method. Probably 2017-10-01 06:00:00will be saved finally in this example. For the data in different loading batches, it can be guaranteed that the data in the later batches will replace the data in the previous batches.

  • 35: Because the aggregate type of cost column is SUM, 35 is obtained by adding 20 + 15.
  • 10: Because the aggregate type of max_dwell_timecolumn is MAX, 10 is obtained by taking the maximum value between 10 and 2 .
  • 2: Because the aggregate type ofmin_dwell_time` column is MIN, 2 is obtained by taking the minimum value from 10 and 2.

Only the aggregated data will be stored in Palo after aggregation. In other words, detailed data will be lost and users can no longer query the detailed data before aggregation.

Example 2: reserving detail data

The structure of table, following example 1, is modified as follows:

ColumnName Type AggregationType Comment
user_id LARGEINT User id
date DATE Data injection date
timestamp DATETIME Data injection time, accurate to seconds
city VARCHAR(20) User's city
age SMALLINT User's age
sex TINYINT User's sex
last_visit_date DATETIME REPLACE Last visit time
cost BIGINT SUM Total consumption of user
max_dwell_time INT MAX Maximum dwelling time of user
min_dwell_time INT MIN Minimum dwelling time of user

Now, a column of timestamp is added to record the data injection time accurate to seconds.

Loaded data are as follows:

user_id date timestamp city age sex last_visit_date cost max_dwell_time min_dwell_time
10000 2017-10-01 2017-10-01 08:00:05 Beijing 20 0 2017-10-01 06:00:00 20 10 10
10000 2017-10-01 2017-10-01 09:00:05 Beijing 20 0 2017-10-01 07:00:00 15 2 2
10001 2017-10-01 2017-10-01 18:12:10 Beijing 30 1 2017-10-01 17:05:45 2 22 22
10002 2017-10-02 2017-10-02 13:10:00 Shanghai 20 1 2017-10-02 12:59:12 200 5 5
10003 2017-10-02 2017-10-02 13:15:00 Guangzhou 32 0 2017-10-02 11:20:00 30 11 11
10004 2017-10-01 2017-10-01 12:12:48 Shenzhen 35 0 2017-10-01 10:00:15 100 3 3
10004 2017-10-03 2017-10-03 12:38:20 Shenzhen 35 0 2017-10-03 10:20:22 11 6 6

Then, load this batch of data correctly into Palo, and the final storage in Palo is as follows:

user_id date timestamp city age sex last_visit_date cost max_dwell_time min_dwell_time
10000 2017-10-01 2017-10-01 08:00:05 Beijing 20 0 2017-10-01 06:00:00 20 10 10
10000 2017-10-01 2017-10-01 09:00:05 Beijing 20 0 2017-10-01 07:00:00 15 2 2
10001 2017-10-01 2017-10-01 18:12:10 Beijing 30 1 2017-10-01 17:05:45 2 22 22
10002 2017-10-02 2017-10-02 13:10:00 Shanghai 20 1 2017-10-02 12:59:12 200 5 5
10003 2017-10-02 2017-10-02 13:15:00 Guangzhou 32 0 2017-10-02 11:20:00 30 11 11
10004 2017-10-01 2017-10-01 12:12:48 Shenzhen 35 0 2017-10-01 10:00:15 100 3 3
10004 2017-10-03 2017-10-03 12:38:20 Shenzhen 35 0 2017-10-03 10:20:22 11 6 6

We can see that the stored data are exactly the same as the loaded data without any aggregation. This is because the timestamp column is added to this batch of data, and the Keys of all rows are not exactly the same. In other words, as long as the Key of each row in the loaded data are not exactly the same, Palo can save the complete detail data even in the aggregation model.

Example 3: aggregation of loaded data and existing data

Following example 1, suppose that the data in the table are as follows:

user_id date city age sex last_visit_date cost max_dwell_time min_dwell_time
10000 2017-10-01 Beijing 20 0 2017-10-01 07:00:00 35 10 2
10001 2017-10-01 Beijing 30 1 2017-10-01 17:05:45 2 22 22
10002 2017-10-02 Shanghai 20 1 2017-10-02 12:59:12 200 5 5
10003 2017-10-02 Guangzhou 32 0 2017-10-02 11:20:00 30 11 11
10004 2017-10-01 Shenzhen 35 0 2017-10-01 10:00:15 100 3 3
10004 2017-10-03 Shenzhen 35 0 2017-10-03 10:20:22 11 6 6

Let's continue to load a new batch of data:

user_id date city age sex last_visit_date cost max_dwell_time min_dwell_time
10004 2017-10-03 Shenzhen 35 0 2017-10-03 11:22:00 44 19 19
10005 2017-10-03 Changsha 29 1 2017-10-03 18:11:02 3 1 1

Then, load this batch of data correctly into PALO, and the final storage in PALO is as follows:

user_id date city age sex last_visit_date cost max_dwell_time min_dwell_time
10000 2017-10-01 Beijing 20 0 2017-10-01 07:00:00 35 10 2
10001 2017-10-01 Beijing 30 1 2017-10-01 17:05:45 2 22 22
10002 2017-10-02 Shanghai 20 1 2017-10-02 12:59:12 200 5 5
10003 2017-10-02 Guangzhou 32 0 2017-10-02 11:20:00 30 11 11
10004 2017-10-01 Shenzhen 35 0 2017-10-01 10:00:15 100 3 3
10004 2017-10-03 Shenzhen 35 0 2017-10-03 11:22:00 55 19 6
10005 2017-10-03 Changsha 29 1 2017-10-03 18:11:02 3 1 1

It can be seen that the existing data of user 10004 and the newly-loaded data are aggregated with the data of the user 10005 added.

There are three stages of data aggregation in PALO:

  1. ETL stage of each batch data loading. This stage aggregates the data loaded in each batch.
  2. The stage when the underlying BE performs data Comparison. In this stage, BE will further aggregate the loaded data of different batches.
  3. Data query stage. In data query, the data involved in the query will be aggregated.

The degree of data aggregation may be inconsistent at different times. When a batch of data are loaded, for example, they may not have been aggregated with the existing data. But for users, users can only query aggregated data, which means that different aggregation degrees are transparent for user queries. Users should always think that the data exists with the degree of final aggregation, and should not assume that some aggregation has not occurred.(Please refer to the section of limitations of aggregation model for more details.)

Unique primary key model

Users, in some multi-dimensional analysis scenarios, pay more attention to how to ensure the uniqueness of the Key, that is, how to obtain the uniqueness constraint of the Primary Key. Therefore, we introduce Unique data model, which is, in essence, a special case of aggregation model and a simplified table structure representation. Here are the examples:

ColumnName Type IsKey Comment
user_id BIGINT Yes User id
username VARCHAR(50) Yes User's nickname
city VARCHAR(20) No User's city
age SMALLINT No User's age
sex TINYINT No User's sex
phone LARGEINT No User's phone number
address VARCHAR(500) No User's address
register_time DATETIME No User registration time

This is a typical table of user base information. There is no aggregation requirement for this kind of data, and they just need to ensure the uniqueness of the primary key.(The primary key here is user_id + username).Then our table creation statements are as follows:

CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
    `user_id` LARGEINT NOT NULL COMMENT " User id",
    `username` VARCHAR(50) NOT NULL COMMENT "User's nickname",
    `city` VARCHAR(20) COMMENT "User's city",
    `age` SMALLINT COMMENT "User's age",
    `sex` TINYINT COMMENT "User's sex",
    `phone` LARGEINT COMMENT "User's phone number",
    `address` VARCHAR(500) COMMENT "User's address",
    `register_time` DATETIME COMMENT "User registration time "
)
UNIQUE KEY(`user_id`, `user_name`)
... /* Omit Partition and Distribution information */

This table structure is exactly the same as the following table structure described by the aggregation model:

ColumnName Type AggregationType Comment
user_id BIGINT User id
username VARCHAR(50) User's nickname
city VARCHAR(20) REPLACE User's city
age SMALLINT REPLACE User's age
sex TINYINT REPLACE User's sex
phone LARGEINT REPLACE User's phone number
address VARCHAR(500) REPLACE User's address
register_time DATETIME REPLACE User registration time

And the table creation statements are as follows:

CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
    `user_id` LARGEINT NOT NULL COMMENT "User id",
    `username` VARCHAR(50) NOT NULL COMMENT "User's nickname",
    `city` VARCHAR(20) REPLACE COMMENT "User's city",
    `age` SMALLINT REPLACE COMMENT "User's age",
    `sex` TINYINT REPLACE COMMENT "User's sex",
    `phone` LARGEINT REPLACE COMMENT "User's phone number",
    `address` VARCHAR(500) REPLACE COMMENT "User's address",
    `register_time` DATETIME REPLACE COMMENT "User registration time"
)
AGGREGATE KEY(`user_id`, `user_name`)
... /* Omit Partition and Distribution information */

That is, Unique model can be fully replaced by REPLACE in the aggregation model. Its internal implementation and data storage are exactly the same. No examples here will be given.

Limitations of aggregation model

Aggregation model (including Unique model) , through a precomputing method, can reduce data volume that need to be calculated in real time and speed up the query. The model, however, has limitations in use.

In the aggregation model, the model displays data after final aggregation for external presentation. In other words, any data that have not yet been aggregated (for example, data from two different loaded batches) must be displayed in a certain way to ensure consistency. Here are the examples:

Suppose the table structure is as follows:

ColumnName Type AggregationType Comment
user_id LARGEINT User id
date DATE Data injection date
cost BIGINT SUM Total user consumption

Suppose that the storage engine has the following two batches of data that have been loaded:

batch 1

user_id date cost
10001 2017-11-20 50
10002 2017-11-21 39

batch 2

user_id date cost
10001 2017-11-20 1
10001 2017-11-21 5
10003 2017-11-22 22

We can see that the data of user 10001 in the two loading batches have not been aggregated. However, in order to ensure that users can only query the final aggregated data as follows:

user_id date cost
10001 2017-11-20 51
10001 2017-11-21 5
10002 2017-11-21 39
10003 2017-11-22 22

We add aggregation operator to query engine to ensure the external consistency of data.

In addition, on the aggregate column (Value), when executing an aggregate class query that is inconsistent with the aggregate type, semantics should be paid attention to. For example, we execute the following query in the above example:

SELECT MIN(cost) FROM table;

The result is 5, not 1.

At the same time, this consistency guarantee will greatly reduce the query efficiency in some queries.

Let's take the most basic count () query as an example:

SELECT COUNT(*) FROM table;

In other databases, this kind of query will return results quickly. Because in the implementation, we can "count rows and save count statistics" when loading, or we can "scan only a column of data to get count value" when querying, we can get the query results with little overhead. However, in Palo's aggregation model, the overhead of this kind of query is rather large.

Let's take the above data as an example

batch 1

user_id date cost
10001 2017-11-20 50
10002 2017-11-21 39

batch 2

user_id date cost
10001 2017-11-20 1
10001 2017-11-21 5
10003 2017-11-22 22

Because the final aggregation result is:

user_id date cost
10001 2017-11-20 51
10001 2017-11-21 5
10002 2017-11-21 39
10003 2017-11-22 22

The correct result of select count(*) from table; should be 4. But if we just scan user_idcolumn and add query aggregation to this column, the final result is 3 (10001, 10002, 10003). If the query aggregation is not added, the result is 5 (five rows of data in two batches). It can be seen that both results are incorrect.

In order to get the correct results, we must read the values of columns user_id and date simultaneously, plus aggregation during query, then the correct result of 4 can be returned . In other words, in the count(*) query, Palo must scan all the AGGREGATE KEY columns (here are user_id and date), and only after aggregation can we get the semantic correct results. When there are too many aggregate columns, the count () query needs to scan a large amount of data.

Therefore, when there are frequent count (\ *) queries in the business, we suggest that users simulate count(*) by adding a column whose value is always 1 and aggregate type is SUM. As for the table structure in the example just now, we modify it as follows:

ColumnName Type AggregateType Comment
user_id BIGINT User id
date DATE Data injection date
cost BIGINT SUM Total user consumption
count BIGINT SUM Used to count

Add a count column and load the data, the value of the column is always 1. Then the result of select count(*) from table; is equivalent to select sum(count) from table;. But the query efficiency of the latter is much higher than that of the former. However, this method also has limitations, that is, users need to guarantee that they will not repeatedly load rows with the same AGGREGATE KEY columns. Otherwise, select sum (count) from table; can only express the number of original loaded rows, not the semantics of select count(*) from table;.

Another way is to change the aggregation type of the above count column to REPLACE, and the value is still always 1. Then the results of select sum(count) from table;' and select count(*) from table; will be consistent. And in this way, there is no restriction on loading duplicate rows.

Duplicate model

Duplicate model does not have this limitation of the aggregate model. Because the model does not involve aggregation semantics, when doing count () query, you can get the result with correct semantics by selecting any query column.

Suggestions on the selection of data models

Because the data model has been determined at the time of table creation and cannot be modified. Therefore, it is very important to choose an appropriate data model.

  1. Aggregate model can greatly reduce the data volume to be scanned and the amount of query computation by pre-aggregation, which is very suitable for the report query scenarios with fixed patterns. However, the model is not friendly to count (*) query. Also, because the aggregation method on the Value column is fixed, semantic correctness should be considered when other types of aggregation queries are carried out.
  2. Unique model can guarantee the uniqueness constraint of the primary key for the scenarios that need the unique primary key constraint. But we can't take advantage of the query advantage brought by Rollup and other pre-aggregation (because the essence is REPLACE and there is no SUM aggregation method).
  3. Duplicate is suitable for Ad-hoc queries of any dimension. Although it is also unable to take advantage of the pre-aggregation feature, it is not constrained by the aggregation model, and can take advantage of the column storage model (only read related columns, but not all Key columns).
Previous
Relation Model and Data Division
Next
Index