CREATE-TABLE

Last Updated：2021-12-14

CREATE TABLE

Description

This command is used to create a table. This article mainly introduces the syntax of creating tables self-maintained by Palo. For syntax of external tables, refer to [CREATE EXTERNAL TABLE](PALO/Operating Manual/Data Load /Synchronize Data Through External Table.md#) Document.

CREATE TABLE [IF NOT EXISTS] [database.]table
(
    column_definition_list,
    [index_definition_list]
)
[engine_type]
[keys_type]
[table_comment]
[partition_info]
distribution_info
[rollup_list]
[properties]
[extra_properties]

column_definition_list

Column definition list:

column_definition[, column_definition]

column_definition

Column definition:

column_name column_type [KEY] [aggr_type] [NULL] [default_value] [column_comment]

column_type

Column type, the following types are supported:

TINYINT（1 byte）
   Range: -2^7 + 1 ~ 2^7 - 1
SMALLINT（2 bytes）
    Range: -2^15 + 1 ~ 2^15 - 1
INT（4 bytes）
    Range: -2^31 + 1 ~ 2^31 - 1
BIGINT（8 bytes）
   Range: -2^63 + 1 ~ 2^63 - 1
LARGEINT（16 bytes）
    Range: -2^127 + 1 ~ 2^127 - 1
FLOAT（4 bytes）
    support scientific counting method
DOUBLE（12 bytes）
    support scientific counting method
DECIMAL[(precision, scale)] (16 bytes)
    guarantee the decimal type of precision, default to DECIMAL(10, 0)
    precision: 1 ~ 27
    scale: 0 ~ 9
    In which, the integer part is 1 ~ 18
    not support the scientific counting method
DATE（3 bytes）
    Range: 0000-01-01 ~ 9999-12-31
DATETIME（8 bytes）
    Range:0000-01-01 00:00:00 ~ 9999-12-31 23:59:59
CHAR[(length)]
    fixed-length string. Range of length: 1 ~ 255. Default to 1
VARCHAR[(length)]
    Variable-length string. Range of length: 1 ~ 65533. Default to 1
HLL (1~16385 bytes)
    HyperLogLog column type,there is no need to specify the length and default value.The length is controlled in the system according to the aggregation degree of data.
   Must be used in combination with HLL_UNION aggregation type.
BITMAP
    bitmap column type,there is no need to specify the length and default value.It represents the sets of integers, elements support up to 2^64 - 1.
    Must be used in combination with BITMAP_UNION aggregation type.

aggr_type

Aggregation type, the following types are supported:

SUM：sum.Applicable to numeric type.
MIN：minimum value. Applicable to numeric type.
MAX：maximum value. Applicable to numeric type.
REPLACE：replace. For rows with the same dimension column, the index columns will be imported in the order of import, and replace the one imported previously with the one imported later.
REPLACE_IF_NOT_NULL：replace if not null. The difference between it with REPLACE lies in only replacing values not null.
HLL_UNION：aggregation method for columns of HLL type, aggregate by HyperLogLog algorithm.
BITMAP_UNION：aggregation method for columns of BIMTAP type, aggregate the union set of bitmaps.

Example:

  ```
  k1 TINYINT,
  k2 DECIMAL(10,2) DEFAULT "10.5",
  k4 BIGINT NULL DEFAULT VALUE "1000" COMMENT "This is column k4",
  v1 VARCHAR(10) REPLACE NOT NULL,
  v2 BITMAP BITMAP_UNION,
  v3 HLL HLL_UNION,
  v4 INT SUM NOT NULL DEFAULT "1" COMMENT "This is column v4"
  ```

index_definition_list

Index list definition:

index_definition[, index_definition]

index_definition

Index definition:

INDEX index_name (col_name[, col_name, ...]) [USING BITMAP] COMMENT 'xxxxxx'

Example:

INDEX idx1 (k1, k2) USING BITMAP COMMENT "This is a bitmap index"

engine_type

Table engine type. All types in this document are OLAP. For other external table engine types, refer to CREATE EXTERNAL TABLE document. Example:

ENGINE=olap
key_desc

Data model.

key_type(col1, col2, ...)

key_type supports the following models:
- DUPLICATE KEY（default）: columns specified after that are ordering columns.
- AGGREGATE KEY：the columns specified after that are dimensional columns.
- UNIQUE KEY：the columns specified after that are key columns.
Example:
```
DUPLICATE KEY(col1, col2),
AGGREGATE KEY(k1, k2, k3),
UNIQUE KEY(k1, k2)
```
table_comment

Table comment. Example:
```
COMMENT "This is my first PALO table"
```

partition_desc

Partition information, support two kinds of writing:

LESS THAN：only define the upper bound of partition. Lower bound is determined by the upper bound of last partition.

PARTITION BY RANGE(col1[, col2, ...])
(
    PARTITION partition_name1 VALUES LESS THAN MAXVALUE|("value1", "value2", ...),
    PARTITION partition_name2 VALUES LESS THAN MAXVALUE|("value1", "value2", ...)
)

FIXED RANGE: define the left-closed and right open interval of partition.

PARTITION BY RANGE(col1[, col2, ...])
(
    PARTITION partition_name1 VALUES [("k1-lower1", "k2-lower1", "k3-lower1",...), ("k1-upper1", "k2-upper1", "k3-upper1", ...)),
    PARTITION partition_name2 VALUES [("k1-lower1-2", "k2-lower1-2", ...), ("k1-upper1-2", MAXVALUE, ))
)

distribution_desc

Define the data bucketing method.

DISTRIBUTED BY HASH (k1[,k2 ...]) [BUCKETS num]
rollup_list

Multiple materialized views (ROLLUP) can be created while creating tables.

ROLLUP (rollup_definition[, rollup_definition, ...])
- rollup_definition
  
  rollup_name (col1[, col2, ...]) [DUPLICATE KEY(col1[, col2, ...])] [PROPERTIES("key" = "value")]
  
  Example:
```
ROLLUP (
    r1 (k1, k3, v1, v2),
    r2 (k1, v1)
)
```
properties

Set table properties. Currently the following properties are supported:
- replication_num
  
  Replication number. The default replication number is 3. If the number of BE node is less than 3, it is required to specify the replication number less than or equal to the number of BE node.
- storage_medium/storage_cooldown_time
  
  Data storage medium.storage_medium is used to declare the initial storage medium of table data, while storage_cooldown_time is used to set the expiration time. Example:
```
"storage_medium" = "SSD",
"storage_cooldown_time" = "2020-11-20 00:00:00"
```
  This example indicates that the data is stored in SSD and will be automatically migrated to HDD storage after it expires at 00: 00: 00 on 2020-11-20.
- colocate_with
  
  When it needs to use Colocation Join function, use this parameter to set Colocation Group.
  
  "colocate_with" = "group1"
- bloom_filter_columns
  
  Users specify the list of column names to be added with Bloom Filter index. The Bloom Filter index of each column is independent, not combined.
  
  "bloom_filter_columns" = "k1, k2, k3"
- in_memory
  
  Set whether the table is Memory Table through this property.
  
  "in_memory" = "true"
- function_column.sequence_type
  
  When using the UNIQUE KEY model, a sequence column can be specified. When the KEY columns are the same, REPLACE will be done by the sequence column (use the larger value to replace the smaller value, otherwise the replacement cannot be executed)
  
  Here we only need to specify the type of sequence columns, time type or integer type is supported. Palo will create a hidden sequence column.
  
  "function_column.sequence_type" = 'Date'
- Dynamic partition related
  
  Parameters related to dynamic partition are as follows:
  - dynamic_partition.enable: used to specify whether the dynamic partitioning function at table level is turned on. Default to true。
  - dynamic_partition.time_unit: Used to specify the time unit for dynamically adding partitions, which can be selected as DAY, WEEK and MONTH
  - dynamic_partition.start: Used to specify how many partitions are deleted forward. The value must be less than 0. Default to Integer.MIN_VALUE。
  - dynamic_partition.end: Used to specify the number of partitions created in advance. The value must be greater than 0.
  - dynamic_partition.prefix: Used to specify the prefix of the created partition name. for example, if the prefix of the partition name is p, the partition name will be automatically created as p20200108
  - dynamic_partition.buckets: Used to specify the number of partition and buckets automatically created

Example

Create a detailed model table

CREATE TABLE example_db.table_hash
(
    k1 TINYINT,
    k2 DECIMAL(10, 2) DEFAULT "10.5",
    k3 CHAR(10) COMMENT "string column",
    k4 INT NOT NULL DEFAULT "1" COMMENT "int column"
)
COMMENT "my first table"
DISTRIBUTED BY HASH(k1) BUCKETS 32

Create a detailed model table, partition and specify the ordering column, set replication number as 1

CREATE TABLE example_db.table_hash
(
    k1 DATE,
    k2 DECIMAL(10, 2) DEFAULT "10.5",
    k3 CHAR(10) COMMENT "string column",
    k4 INT NOT NULL DEFAULT "1" COMMENT "int column"
)
DUPLICATE KEY(k1, k2)
COMMENT "my first table"
PARTITION BY RANGE(k1)
(
    PARTITION p1 VALUES LESS THAN ("2020-02-01"),
    PARTITION p1 VALUES LESS THAN ("2020-03-01"),
    PARTITION p1 VALUES LESS THAN ("2020-04-01")
)
DISTRIBUTED BY HASH(k1) BUCKETS 32
PROPERTIES (
    "replication_num" = "1"
);

Create a unique key model table, set the initial storage medium and cooldown time

CREATE TABLE example_db.table_hash
(
    k1 BIGINT,
    k2 LARGEINT,
    v1 VARCHAR(2048) REPLACE,
    v2 SMALLINT SUM DEFAULT "10"
)
UNIQUE KEY(k1, k2)
DISTRIBUTED BY HASH (k1, k2) BUCKETS 32
PROPERTIES(
    "storage_medium" = "SSD",
    "storage_cooldown_time" = "2015-06-04 00:00:00"
);

Create an aggregate model table and describe by partition with fixed zone

CREATE TABLE table_range
(
    k1 DATE,
    k2 INT,
    k3 SMALLINT,
    v1 VARCHAR(2048) REPLACE,
    v2 INT SUM DEFAULT "1"
)
AGGREGATE KEY(k1, k2, k3)
PARTITION BY RANGE (k1, k2, k3)
(
    PARTITION p1 VALUES [("2014-01-01", "10", "200"), ("2014-01-01", "20", "300")),
    PARTITION p2 VALUES [("2014-06-01", "100", "200"), ("2014-07-01", "100", "300"))
)
DISTRIBUTED BY HASH(k2) BUCKETS 32

Create an aggregate model table containing HLL and BITMAP column types

CREATE TABLE example_db.example_table
(
    k1 TINYINT,
    k2 DECIMAL(10, 2) DEFAULT "10.5",
    v1 HLL HLL_UNION,
    v2 BITMAP BITMAP_UNION
)
ENGINE=olap
AGGREGATE KEY(k1, k2)
DISTRIBUTED BY HASH(k1) BUCKETS 32

Create two tables self-maintained by same Colocation Group.

CREATE TABLE t1 (
    id int(11) COMMENT "",
    value varchar(8) COMMENT ""
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 10
PROPERTIES (
    "colocate_with" = "group1"
);

CREATE TABLE t2 (
    id int(11) COMMENT "",
    value1 varchar(8) COMMENT "",
    value2 varchar(8) COMMENT ""
)
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES (
    "colocate_with" = "group1"
);

Create a memory table with bitmap index and bloom filter index

CREATE TABLE example_db.table_hash
(
    k1 TINYINT,
    k2 DECIMAL(10, 2) DEFAULT "10.5",
    v1 CHAR(10) REPLACE,
    v2 INT SUM,
    INDEX k1_idx (k1) USING BITMAP COMMENT 'my first index'
)
AGGREGATE KEY(k1, k2)
DISTRIBUTED BY HASH(k1) BUCKETS 32
PROPERTIES (
    "bloom_filter_columns" = "k2",
    "in_memory" = "true"
);

Create a dynamic partition table.

Create partitions for three days in the table in advance every day, and delete partitions three days before. For example, today is 2020-01-08, then the partitions named as p20200108, p20200109, p20200110, p20200111 will be created. The ranges of such partitions are respectively:

[types: [DATE]; keys: [2020-01-08]; ‥types: [DATE]; keys: [2020-01-09]; )
[types: [DATE]; keys: [2020-01-09]; ‥types: [DATE]; keys: [2020-01-10]; )
[types: [DATE]; keys: [2020-01-10]; ‥types: [DATE]; keys: [2020-01-11]; )
[types: [DATE]; keys: [2020-01-11]; ‥types: [DATE]; keys: [2020-01-12]; )

CREATE TABLE example_db.dynamic_partition
(
    k1 DATE,
    k2 INT,
    k3 SMALLINT,
    v1 VARCHAR(2048),
    v2 DATETIME DEFAULT "2014-02-04 15:36:00"
)
DUPLICATE KEY(k1, k2, k3)
PARTITION BY RANGE (k1) ()
DISTRIBUTED BY HASH(k2) BUCKETS 32
PROPERTIES(
    "dynamic_partition.time_unit" = "DAY",
    "dynamic_partition.start" = "-3",
    "dynamic_partition.end" = "3",
    "dynamic_partition.prefix" = "p",
    "dynamic_partition.buckets" = "32" 
);

Create a table with materialized view (ROLLUP).

CREATE TABLE example_db.rolup_index_table
(
    event_day DATE,
    siteid INT DEFAULT '10',
    citycode SMALLINT,
    username VARCHAR(32) DEFAULT '',
    pv BIGINT SUM DEFAULT '0'
)
AGGREGATE KEY(event_day, siteid, citycode, username)
DISTRIBUTED BY HASH(siteid) BUCKETS 10
ROLLUP (
    r1(event_day,siteid),
    r2(event_day,citycode),
    r3(event_day)
)
PROPERTIES("replication_num" = "3");

Keyword

CREATE, TABLE

Best Practices

Partition and bucketing

In a table, a bucketing column must be specified, and a partition may not. Detailed introduction of partition and bucketing can be found in Data Division document.

Tables in Palo can be divided into partitioned tables and non-partitioned tables. This property is determined when the table is created, and cannot be changed afterwards. For partitioned tables, operations such as addition, deletion can be done during the process of use, while for non-partitioned tables, such operations are not allowed.

At the same time, the partition columns and bucketing columns cannot be changed after the table is created, neither the type of partition column and bucketing column can be changed, nor any addition or deletion operation can be performed on these columns.

So it is suggested to confirm the usage mode before building the table.

Dynamic Partition

The dynamic partition function is mainly used to help users manage partitions automatically. After set with related rules, Palo system regularly adds new partitions or deletes historical partitions. For more help, refer to Dynamic Partition document.

Materialized Views

Users can create multiple materialized views (ROLLUP) while building tables. Materialized views can also be added after the table is created. It can facilitate users to create all materialized views at one time by writing it in table building statements.

If materialized views are created during table building process, all subsequent data import operations will generate materialized view data synchronously. The number of materialized views may affect the efficiency of data import.

If materialized views are added during later use, if there is data in the table, the creation time of materialized views depends on the current data amount.

Introduction to materialized views can be found in Materialized Views.

Index

Users can create indexes of multiple columns while building tables. Indexes can also be added after the table is created.

When adding index in the later use, if there is data existing in the table, it is required to rewrite all the data, and the index creation time will be determined by the current data volume.

Memory Table

When the "in_memory" = "true" property is specified when building table, Palo will cache the data blocks of this table in the PageCache of the storage engine as far as possible to reduce the disk IO. However, this property cannot guarantee the data block is stored in the memory all the time but only works as an indicator of best-effort.

CREATE-FILE

CREATE-REPOSITORY

百度智能云

Data Warehouse