Materialized View
Materialized view
Materialized view is a special table that stores the data set that has been calculated in advance (according to the defined SELECT statement) in Palo.
Materialized view is mainly to meet the needs of users, it can not only analyze any dimension of the original detail data, but also analyze and query the fixed dimension quickly.
Applicable scenarios
- Analysis requirements detailed data query and fixed dimension query.
- Queries involve only a small number of columns or rows in a table.
- Query contains some time-consuming processing operations, such as long time aggregation.
- Queries need to match different prefix indexes.
Advantages
- The query performance for those using the same subquery results repeatedly is greatly improved.
- Palo maintains the data of materialized view automatically and ensures the data consistency of base table and materialized view whether they are new loading or deletion operation. No additional labor maintenance costs are required.
- It will automatically match to the optimal materialized view and read the data directly from it during querying.
Data of automatic maintenance of materialized view will cause some maintenance overhead, which will be explained in the subsequent limitations of materialized view.
Materialized view VS Rollup
Before materialized view function, users generally use Rollup function to improve query efficiency through pre-aggregation. But Rollup has some limitations and it can't do pre-aggregation based on detail model.
Covering Rollup function, materialized view also supports more abundant aggregate functions, which means that materialized view is actually a superset of Rollup.
In other words, the functions supported by the previous ALTER TABLE ADD ROLLUP syntax can now be implemented through CREATE MATERIALIZED VIEW.
Using materialized views
Palo system provides a complete set of DDL syntax for materialized views, including creating, viewing and deletion. The DDL syntax is consistent with PostgreSQL and Oracle.
Create materialized views
First, you need to decide what kind of materialized view to create according to the characteristics of your query statement. That is not to say, however, that it is best that your materialized view definition is identical to one of your queries. And we have two principles:
- Abstract from the query statement, the grouping and aggregation methods shared by multiple queries are defined as materialized views.
- There is no need to create materialized views for all dimension combinations.
First of all, if a materialized view is abstracted, multiple queries can match it. This is the best materialized view because the maintenance of materialized view itself also needs resource consumption.
If the materialized view only fits a special query, no other query can use it. And it will lead to the low cost performance of this materialized view, which not only takes up the storage resources of the cluster, but also cannot serve more queries.
So the user needs to abstract some materialized view definitions by combining own query statements and data dimension information.
The second is that in the actual analysis query, it will not cover all dimension analysis. Therefore, to create materialized view for common dimension combination can achieve a balance of space and time.
Create materialized views with the following command. Creating materialized view is an asynchronous operation, that is, after the user successfully submits the creation task, Palo calculates the stock data in the background until the view is successfully created.
Refer to CREATE MATERIALIZED VIEW for specific syntax.
Supported aggregate functions
At present, aggregation functions supported by materialized view creation statements are as follows:
- SUM, MIN, MAX
- COUNT, BITMAP_UNION, HLL_UNION
- The form of BITMAP_UNION must be:
BITMAP_UNION(TO_BITMAP(COLUMN))
, column can only be of type integer orBITMAP_UNION(COLUMN)
(largeint is not supported, either) , and the base table is of AGG model. - The form of HLL_UNION must be:
HLL_UNION(HLL_HASH(COLUMN))
, column cannot be of type DECIMAL orHLL_UNION(COLUMN)
, and the base table is of AGG model.
Update strategy
Palo will synchronize the operations of loading and deleting base tables into materialized view table to ensure the consistency of data between materialized view and base table. Also, Palo will improve the update efficiency through incremental update and guarantee the atomicity through transactions.
If the user, for example, inserts data into the base table through INSERT command, the data will be synchronized into the materialized view. Only when the base table and materialized view table are successfully written can INSERT command return successfully.
Query of automatic matching
After the materialized view is created successfully, the user's query does not need any change, that is, the base table of the query. Palo will automatically select an optimal materialized view according to the current query statement, and read data from the materialized view for calculation.
The user can check whether materialized views are used in the current query through EXPLAIN command.
Matching relationship between aggregation in materialized view and aggregation in query:
Aggregation in materialized view | Aggregation in query |
---|---|
sum | sum |
min | min |
max | max |
count | count |
bitmap_union | bitmap_union, bitmap_union_count, count(distinct) |
hll_union | hll_raw_agg, hll_union_agg, ndv, approx_count_distinct |
After the aggregate function of bitmap and hll matches the materialized view, the aggregation operator of the query will be rewritten according to the table structure of the materialized view. Refer to Example 2 for details.
Query of materialized view
Check the kinds of materialized views of the current table and their table structure through the following commands:
MySQL [test]> desc mv_test all;
+-----------+---------------+-----------------+----------+------+-------+---------+--------------+
| IndexName | IndexKeysType | Field | Type | Null | Key | Default | Extra |
+-----------+---------------+-----------------+----------+------+-------+---------+--------------+
| mv_test | DUP_KEYS | k1 | INT | Yes | true | NULL | |
| | | k2 | BIGINT | Yes | true | NULL | |
| | | k3 | LARGEINT | Yes | true | NULL | |
| | | k4 | SMALLINT | Yes | false | NULL | NONE |
| | | | | | | | |
| mv_2 | AGG_KEYS | k2 | BIGINT | Yes | true | NULL | |
| | | k4 | SMALLINT | Yes | false | NULL | MIN |
| | | k1 | INT | Yes | false | NULL | MAX |
| | | | | | | | |
| mv_3 | AGG_KEYS | k1 | INT | Yes | true | NULL | |
| | | to_bitmap(`k2`) | BITMAP | No | false | | BITMAP_UNION |
| | | | | | | | |
| mv_1 | AGG_KEYS | k4 | SMALLINT | Yes | true | NULL | |
| | | k1 | BIGINT | Yes | false | NULL | SUM |
| | | k3 | LARGEINT | Yes | false | NULL | SUM |
| | | k2 | BIGINT | Yes | false | NULL | MIN |
+-----------+---------------+-----------------+----------+------+-------+---------+--------------+
As we can see, the current mvtest
table has three materialized views: mv 1, mv_ 2 and mv_ 3, and their table structures can also be seen.
Deletion of materialized view
You can delete materialized view through the following command if you no longer need it: CROP MATERIALIZED VIEW.
Best practice 1
Materialized view usage is generally divided into the following steps:
- Create a materialized view
- Asynchronously check whether the materialized view is completed
- Query and automatically match materialized views
First stem: create a materialized view
Suppose the user has a sales record list, storing the transaction id, salesperson, selling store, selling time and amount of each transaction. Create the table with the following statements:
create table sales_records(record_id int, seller_id int, store_id int, sale_date date, sale_amt bigint) distributed by hash(record_id) properties("replication_num" = "1");
The structure of the table sales_records
is as follows:
MySQL [test]> desc sales_records;
+-----------+--------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------+------+-------+---------+-------+
| record_id | INT | Yes | true | NULL | |
| seller_id | INT | Yes | true | NULL | |
| store_id | INT | Yes | true | NULL | |
| sale_date | DATE | Yes | false | NULL | NONE |
| sale_amt | BIGINT | Yes | false | NULL | NONE |
+-----------+--------+------+-------+---------+-------+
At this time, if the user often makes an analysis and query on the sales volume of different stores, then create a materialized view of the sum of sales of the same selling stores grouped by selling stores for the table sales_records
with the following statements:
create materialized view store_amt as select store_id, sum(sale_amt) from sales_records group by store_id;
If it returns successfully, the task of creating materialized view is submitted successfully.
Step 2: check whether the materialized view is completed
Because creating materialized view is an asynchronous operation, after submitting the task of creating materialized view, the user needs to check whether the materialized view has been completed asynchronously through the command, which is as follows:
SHOW ALTER TABLE MATERIALIZED VIEW FROM db_name;
Refer to SHOW ALTER TABLE MATERIALIZED VIEW for more help of this command.
Step 3: query
When the materialized view is created and when the user queries the sales volume of different stores again, the aggregated data in created materialized view store_amt
will be read directly to improve the query efficiency.
The table sales_records
is still specified by the user for query, for example:
SELECT store_id, sum(sale_amt) FROM sales_records GROUP BY store_id;
The above query will automatically match to store_amt
. The user can check whether the current query matches the appropriate materialized view through the following commands.
MySQL [test]> EXPLAIN SELECT store_id, sum(sale_amt) FROM sales_records GROUP BY store_id;
+-----------------------------------------------------------------------------+
| Explain String |
+-----------------------------------------------------------------------------+
| PLAN FRAGMENT 0 |
| OUTPUT EXPRS:<slot 2> `store_id` | <slot 3> sum(`sale_amt`) |
| PARTITION: UNPARTITIONED |
| |
| RESULT SINK |
| |
| 4:EXCHANGE |
| |
| PLAN FRAGMENT 1 |
| OUTPUT EXPRS: |
| PARTITION: HASH_PARTITIONED: <slot 2> `store_id` |
| |
| STREAM DATA SINK |
| EXCHANGE ID: 04 |
| UNPARTITIONED |
| |
| 3:AGGREGATE (merge finalize) |
| | output: sum(<slot 3> sum(`sale_amt`)) |
| | group by: <slot 2> `store_id` |
| | |
| 2:EXCHANGE |
| |
| PLAN FRAGMENT 2 |
| OUTPUT EXPRS: |
| PARTITION: RANDOM |
| |
| STREAM DATA SINK |
| EXCHANGE ID: 02 |
| HASH_PARTITIONED: <slot 2> `store_id` |
| |
| 1:AGGREGATE (update serialize) |
| | STREAMING |
| | output: sum(`sale_amt`) |
| | group by: `store_id` |
| | |
| 0:OlapScanNode |
| TABLE: sales_records |
| PREAGGREGATION: ON |
| partitions=1/1 |
| rollup: store_amt |
| tabletRatio=10/10 |
| tabletList=22038,22040,22042,22044,22046,22048,22050,22052,22054,22056 |
| cardinality=0 |
| avgRowSize=0.0 |
| numNodes=1 |
+-----------------------------------------------------------------------------+
45 rows in set (0.006 sec)
Among them, the most important one is Rollup property in OlapScanNode. We can see that the Rollup of the current query displaysstore_amt
, which means the query has been correctly matched to the materialized view store_amt
, and can read data directly from materialized view.
Best practice 2 PV,UV
Business scenario: calculate UV and PV of advertisement
Suppose that the user's original ad click data are stored in Palo, for ad PV, UV query can be sped up by creating a materialized viewbitmap_union
.
First, create a table to store the click data details of ads, including the click events of each click, what ads are clicked, through what channels, and who clicked the ads.
MySQL [test]> create table advertiser_view_record(time date, advertiser varchar(10), channel varchar(10), user_id int) distributed by hash(time) properties("replication_num" = "1");
Query O
K, 0 rows affected (0.014 sec)
The structure of original ad click data table is as follows:
MySQL [test]> desc advertiser_view_record;
+------------+-------------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+-------------+------+-------+---------+-------+
| time | DATE | Yes | true | NULL | |
| advertiser | VARCHAR(10) | Yes | true | NULL | |
| channel | VARCHAR(10) | Yes | false | NULL | NONE |
| user_id | INT | Yes | false | NULL | NONE |
+------------+-------------+------+-------+---------+-------+
4 rows in set (0.001 sec)
-
Create a materialized view
Since the user wants to query the UV value of the advertisement, which means the user of the same advertisement needs to be accurately de-duplicated, then the query is generally as follows:
SELECT advertiser, channel, count(distinct user_id) FROM advertiser_view_record GROUP BY advertiser, channel;
For this kind of UV scenario, we can create a materialized view with
bitmap_union
to achieve a pre-accurate de-duplication.In Palo, the aggregation results of
count (distinct)
andbitmapunion_count
are fully identical. And
bitmap_union_countequals
bitmap_unionresult of count, so if **the query involves
count(distinct)`, you can use > to create a materialized view with bitmap union aggregation to speed up query.**For this case, you can create a materialized view based on advertisement and channel group for precise de-duplication of
user_id
.MySQL [test]> create materialized view advertiser_uv as select advertiser, channel, bitmap_union(to_bitmap(user_id)) from advertiser_view_record group by advertiser, channel; Query OK, 0 rows affected (0.012 sec)
Note: Because user_id itself is of INT type, you need to convert the filed to bitmap type through function
to_bitmap
beforebitmap_union
aggregation in Palo.After the creation, the table structures of ad click list and materialized view are as follows:
MySQL [test]> desc advertiser_view_record all; +------------------------+---------------+----------------------+-------------+------+-------+---------+--------------+ | IndexName | IndexKeysType | Field | Type | Null | Key | Default | Extra | +------------------------+---------------+----------------------+-------------+------+-------+---------+--------------+ | advertiser_view_record | DUP_KEYS | time | DATE | Yes | true | NULL | | | | | advertiser | VARCHAR(10) | Yes | true | NULL | | | | | channel | VARCHAR(10) | Yes | false | NULL | NONE | | | | user_id | INT | Yes | false | NULL | NONE | | | | | | | | | | | advertiser_uv | AGG_KEYS | advertiser | VARCHAR(10) | Yes | true | NULL | | | | | channel | VARCHAR(10) | Yes | true | NULL | | | | | to_bitmap(`user_id`) | BITMAP | No | false | | BITMAP_UNION | +------------------------+---------------+----------------------+-------------+------+-------+---------+--------------+
-
Query of automatic matching
When the materialized view view is created and the advertisement UV is queried, Palo will automatically query data from created materialized view
advertiser_uv
. The original query statements, for example, are as follows:SELECT advertiser, channel, count(distinct user_id) FROM advertiser_view_record GROUP BY advertiser, channel;
After the materialized view is selected, the actual query will be converted to:
SELECT advertiser, channel, bitmap_union_count(to_bitmap(user_id)) FROM advertiser_uv GROUP BY advertiser, channel;
Check whether Palo matched the materialized view through EXPLAIN command:
MySQL [test]> explain SELECT advertiser, channel, count(distinct user_id) FROM advertiser_view_record GROUP BY advertiser, channel; +-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Explain String | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | PLAN FRAGMENT 0 | | OUTPUT EXPRS:<slot 7> `advertiser` | <slot 8> `channel` | <slot 9> bitmap_union_count(`default_cluster:test`.`advertiser_view_record`.`mv_bitmap_union_user_id`) | | PARTITION: UNPARTITIONED | | | | RESULT SINK | | | | 4:EXCHANGE | | | | PLAN FRAGMENT 1 | | OUTPUT EXPRS: | | PARTITION: HASH_PARTITIONED: <slot 4> `advertiser`, <slot 5> `channel` | | | | STREAM DATA SINK | | EXCHANGE ID: 04 | | UNPARTITIONED | | | | 3:AGGREGATE (merge finalize) | | | output: bitmap_union_count(<slot 6> bitmap_union_count(`default_cluster:test`.`advertiser_view_record`.`mv_bitmap_union_user_id`)) | | | group by: <slot 4> `advertiser`, <slot 5> `channel` | | | | | 2:EXCHANGE | | | | PLAN FRAGMENT 2 | | OUTPUT EXPRS: | | PARTITION: RANDOM | | | | STREAM DATA SINK | | EXCHANGE ID: 02 | | HASH_PARTITIONED: <slot 4> `advertiser`, <slot 5> `channel` | | | | | | 1:AGGREGATE (update serialize) | | | STREAMING | | | output: bitmap_union_count(`default_cluster:test`.`advertiser_view_record`.`mv_bitmap_union_user_id`) | | | group by: `advertiser`, `channel` | | | | | 0:OlapScanNode | | TABLE: advertiser_view_record | | PREAGGREGATION: ON | | partitions=1/1 | | rollup: advertiser_uv | | tabletRatio=10/10 | | tabletList=22084,22086,22088,22090,22092,22094,22096,22098,22100,22102 | | cardinality=0 | | avgRowSize=0.0 | | numNodes=1 | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 45 rows in set (0.030 sec)
In EXPLAIN results, we can first see that the value of Rollup property of OlapScanNode is advertiser_ uv which means the query directly scans the materialized view data, indicating that the match is successful.
Second, for the field
userid`, count (distinct) is rewritten to bitmap union count(to bitmap)`to achieve accurate de-duplication through bitmap.
Best practice 3
Business scenario: Matching richer prefix index
The user's original table has three columns (k1, k2, k3), where k1 and k2 are prefix index columns. In this case, if the user's query criteria contain where k1=1 and k2=2
, the query can be accelerated through the index.
However, in some cases, the user's filtering conditions cannot match the prefix index, where k3=3
for example. Then the query speed cannot be improved by index.
Create a materialized view with k3 as the first column to solve this problem.
-
Create a materialized view
CREATE MATERIALIZED VIEW mv_1 as SELECT k3, k2, k1 FROM tableA ORDER BY k3;
After creating the view through the above syntax, the materialized view retains the complete detail data, and its prefix index is column k3. Here is the table structure:
MySQL [test]> desc tableA all; +-----------+---------------+-------+------+------+-------+---------+-------+ | IndexName | IndexKeysType | Field | Type | Null | Key | Default | Extra | +-----------+---------------+-------+------+------+-------+---------+-------+ | tableA | DUP_KEYS | k1 | INT | Yes | true | NULL | | | | | k2 | INT | Yes | true | NULL | | | | | k3 | INT | Yes | true | NULL | | | | | | | | | | | | mv_1 | DUP_KEYS | k3 | INT | Yes | true | NULL | | | | | k2 | INT | Yes | false | NULL | NONE | | | | k1 | INT | Yes | false | NULL | NONE | +-----------+---------------+-------+------+------+-------+---------+-------+
-
Query matching
At this time, if the filter condition is that there is column K3 in the user's query, for example:
select k1, k2, k3 from table A where k3=3;
then the query will read data directly from the created materialized view mv_1 . Materialized view has prefix index to k3, so the query efficiency will be improved.
Limitations
- The parameter of aggregate function of materialized view supports only single column rather than expressions, sum (a+b), for example, is not supported.
- If the condition column of the deletion statement is not in the materialized view, the deletion operation cannot be performed. Delete materialized view first before deleting data if you really need to delete data.
- Too many materialized views on a single table will affect the loading efficiency: materialized views and base table data are updated synchronously when loading data, if there are more than 10 materialized views in a table, the loading speed may be very slow, which is similar to loading data of 10 tables simultaneously for a single load.
- The same column and different aggregation functions cannot appear in one materialized view simultaneously, select sum(a), min(a) from table are not supported, for example.
Exception error
-
DATA_QUALITY_ERR: "The data quality does not satisfy, please check your data"
This is caused by failed materialized view creation due to data quality problems. Note: bitmap type only supports positive integers, if there are negative integers in original data, the creation of materialized views will fail.