Index

Last Updated：2021-04-13

Index is used to help filter or find data quickly.

Currently, Palo mainly supports two kinds of indexes: built-in intelligent index, including prefix index and ZoneMap index. Secondary indexes created by users include Bloom Filter index and Bitmap inverted index.

ZoneMap index is the index information automatically maintained for each columns in column storage format, including Min / Max, number of Null values, etc. This kind of index is transparent to users and is not described here. The following mainly introduces the other three types of indexes.

Prefix index

Principles

In essence, Palo's data are stored in a data structure similar to SStable (Sorted String Table), which is an ordered data structure and can be sorted and stored according to the specified columns. In this data structure, it will be very efficient to search by arranging sequence.

The prefix index, which is based on sorting, is an index method to query data quickly according to the given prefix columns.

Prefix index is a sparse index created with the granularity of Block, a block contains 1024 rows of data, and for each Block, the value of the prefix column of the first row of data of the Block is used as the index.

We use the first 36 bytes of a row of data as the prefix index of this row of data. When a VARCHAR type is encountered, the prefix index is truncated directly. Here are the examples:

The prefix index for the following table structure is user_id(8Byte) + age(4Bytes) + message(prefix 20 Bytes).

ColumnName	Type
user_id	BIGINT
age	INT
message	VARCHAR(100)
max_dwell_time	DATETIME
min_dwell_time	DATETIME

The prefix index for the following table structure is user user_name(20 Bytes). Because VARCHAR is encountered, it will be truncated directly and will not continue in the future even if it does not reach 36 bytes.

ColumnName	Type
user_name	VARCHAR(20)
age	INT
message	VARCHAR(100)
max_dwell_time	DATETIME
min_dwell_time	DATETIME

Usage scenarios

When our query condition is the prefix of prefix index, the query speed can be greatly increased. Now we execute the following query in the first example:

SELECT * FROM table WHERE user_id=1829239 and age=20；

The efficiency of this query is much higher than that of the following query:

SELECT * FROM table WHERE age=20；

Therefore, when creating a table, the correct selection of column order can greatly improve the query efficiency.

Bloom Filter index

Principles

When creating a table, users can specify to create Bloom Filter index on some columns (hereinafter referred to as BF index). We can also add a BF index by using ALTER TABLE command at run time.

Bloom Filter is essentially a bitmap structure used to quickly determine whether a given value is in a set. This kind of judgment will produce a small probability of misjudgment. That is, if False is returned, it must not be in this set. And if the range is True, it may be in this set.

BF index is also created for granularity by Block. In each Block, the value of the specified column is used as a set to generate a BF index entry, which is used to quickly filter the data that does not meet the conditions in the query.

Applicable scenarios

Due to the characteristics of Bloom Filter data structure, BF index is more suitable to be created on columns with high cardinality, such as UserID. If it is created on a low cardinality column, such as the "sex"column, then every Block will contain almost all values, making BF index meaningless.

Bitmap index

Principles

When creating a table, users can specify to create a Bitmap index on some columns. Also, users can add a new Bitmap index by ALTER TABLE command at run time.

Bitmap index is a special database index technology, which uses bit array (or bitmap, bit set, bit string, bit vector) to store and calculate. Each bit in the position code indicates the existence of the data row corresponding to the key value. A bitmap may point to dozens or even hundreds of rows of data.

Compared with the B * tree index, this way of storing data takes up very little space and is very fast to create and use. When querying according to the key value, users can quickly locate the specific row number according to Bitmap index. When we do and / or or in (x, y,..) query according to the key value, we directly do so or operate through the bitmap of the index to quickly get the result row data.

There are the following restrictions for Bitmap index in Palo:

Bitmap index is created only on a single column
The data types supported by bitmap index are as follows:
- TINYINT
- SMALLINT
- INT
- UNSIGNEDINT
- BIGINT
- CHAR
- VARCHAE
- DATE
- DATETIME
- LARGEINT
- DECIMAL
- BOOL

Applicable scenarios

It is suitable for columns with low cardinality and It is recommended to be between 100 and 100000, such as occupation, city, etc. If the cardinality is too high, there is no obvious advantage; If the cardinality is too low, the space efficiency and performance will be greatly reduced.
For certain types of queries such as count, or, and as well as other logical operations, only bit operations are needed. Through multiple condition combination query scenarios like select count(*) from table where city = 'beijing' and job = 'teacher', if a bitmap index is established on each query condition column, efficient bit operation can be performed to accurately locate the required data and scan the data. The smaller the filtered result set is, the more obvious the advantage of bitmap index will be.

Data Model

Operating Manual

百度智能云

Data Warehouse

Index

Prefix index

Principles

Usage scenarios

Bloom Filter index

Principles

Applicable scenarios

Bitmap index

Principles

Applicable scenarios