Use DTS to Achieve Data Migration Taking Elasticsearch as Destination

Last Updated：2021-05-20

1. Applicable Scenarios

This article is applicable for using Baidu AI Cloud Data Transmission Service (DTS) (hereinafter referred to as DTS) to migrate data from any upstream data source (MySQL/Oracle, etc.) that DTS has supported to the destination Elasticsearch instance.

2. Usage Restriction

Do not support structure migration
Incremental synchronization does not support DDL statements that synchronize relational databases
When the source data source is MySQL, the enum/set/timestamp type field in the source table can only transmit binary information to the downstream, and the user needs to parse its interpretation according to the upstream MySQL table structure
Do not support the binary data of the source data source of the isochronous sourcepoint

3. Migration Preconditions

3.1 Environmental Requirements

The data source instance as the migration source has been created, such as: RDS for MySQL, self-built MySQL, self-built Oracle, etc.

The Elasticsearch instance as the destination of the migration has been created, such as: Baidu Elasticsearch instance, self-built Elasticsearch instance. The Elasticsearch version supported by DTS is 5.0 and above

3.2 Privilege Requirements

Source: It is enough to meet the minimum requirements for source privilege in the stages of structure migration, full migration, incremental migration, and data verification. For details, please refer to best practices for data source migration to cloud.

Destination: The account should have the privilege to create indexes and write data in the Elasticsearch instance.

3.3 Destination Elasticsearch Recommended Configuration

Currently, DTS does not support structure migration. If you choose full/incremental migration, when writing data downstream, Elasticsearch will automatically build an index according to the format of the written data and specify the corresponding index mapping (mapping). However, it is possible that the field types that Elasticsearch maps by itself do not meet the actual usage requirements, so this is not recommended.

We recommend before starting the DTS task, the relevant index of the destination Elasticsearch should be established, and the type of each field be specified according to the actual query requirements, and then start the DTS data migration task.

In addition, in order to avoid excessive load on downstream Elasticsearch instances caused by massive data writing, we recommend that the cluster granularity attribute refresh be set_ Interval should be configured as a relatively reasonable value. If you can accept that the newly written data is visible 1 minute after writing, you can consider setting refresh_interval to 1m.

PUT /indice/_settings 
{ 
    "index" : { 
        "refresh_interval" : "1m" 
    } 
}

4. Operation Steps

4.1 Create a Migration Task

Enter the DTS list page and click [Create Data Transmission Instance]

After configuring the basic information of the instance, pay for the DTS instance.

4.2 Configuration Task

After returning to the DTS list page, you can see the DTS instance you just purchased, the status is [Not configured].

Click [Configuration Task]

4.2.1 Connection Configuration

First, configure the upstream and downstream connection information of the data transmission task. You can choose a cloud database instance or self-built instance as needed.

As shown in FIG. In the source connection information, you can select different data sources as the source of the task as needed, including: RDS for MySQL, self-built MySQL, self-built Oracle, etc. The RDS for MySQL instance in the East China-Suzhou region is selected in the figure.

The destination should select an Elasticsearch instance. At present, the destination of DTS supports cloud hosting product Baidu Elasticsearch and self-built Elasticsearch instances. The Baidu Elasticsearch instance in the North China-Baoding region is selected in the figure. The configuration parameters of the Elasticsearch destination are described as follows:

Access type: Support public network/BCC/BBC/DCC self-built Elasticsearch instance and Baidu Elasticsearch instance.
Type of data: Fixed selection of Elasticsearch.
Region: The logical region of Baidu Cloud where the self-built instance/Baidu Elasticsearch instance is located. There is no need to select a region when the access type is public network.
Instance ID: When the access type is BCC/BBC/DCC, it means the BCC/BBC/DCC instance ID; when the access type is Baidu Elasticsearch, it means the Baidu Elasticsearch instance ID. When the access type is public network, you do not need to select the instance ID, but you need to fill in the instance IP.
Port: Access port of Elasticsearch instance, Baidu Elasticsearch instance does not need to fill in this field.

If your Elasticsearch instance is a self-built cluster and the cluster contains multiple nodes, you only need to fill in the connection information of a certain node in the cluster when configuring the task. It is recommended to use the master node configuration task

Account authentication: When Baidu Elasticsearch is selected as the access type, the account and password are required by default, which can be found on the Baidu Elasticsearch instance details page. When the access type is public network/BCC/BBC/DCC, you can choose whether to fill in account authentication information according to the actual configuration of the self-built Elasticsearch instance.
Insertion method: It belongs to the data view related functions, and the configuration method is detailed in [[Data View]->[How to Configure Task]](#531-Connection Configuration) below. For normal data migration tasks, select task default.
Specify the primary key: It belongs to the data view related functions, and the configuration method is detailed in [[Data View]->[How to Configure Task]](#531-Connection Configuration) below. No need to configure this field for normal data migration tasks.
Load balance: It can only be selected when the access type is public network/BCC/BBC/DCC self-built. If load balancing is turned on, DTS will query and maintain the global topology of the Elasticsearch cluster during the data migration process, and randomly send requests to any surviving node in the cluster when writing.

After completing the configuration, click [Authorized White List to the next step] to enter the object mapping configuration page.

4.2.2 Object Mapping

As shown in the figure above, the data flow is: Cloud database RDS (MySQL) - > Baidu Elasticsearch (Elasticsearch), select full migration + incremental synchronization for migration type. The transfer object can choose to migrate the entire instance at the source, or choose to manually filter the migration object. In the figure, the three tables multi_pk_tbl1, multi_pk_tbl2, multi_pk_tbl3 under the dtstest library are selected as the migration objects.

The selected migration object will appear in the selected object list on the right. DTS supports upstream and downstream library table name mapping, row filtering, column filtering black and white lists and other functions. You can click [Edit] to configure mapping and filtering rules for each migration object.

In the above figure, the sample task is configured with the table name mapping rules of multi_pk_tbl1 and multi_pk_tbl2, and the destination table names are changed to tbl1 and tbl2. You can also configure row filtering rules and column filtering black and white lists according to your needs.

The mapping rule for DTS data migration is that a table at the source corresponds to an index in Elasticsearch, and the index name is: Mapped library name_mapped table name. For example, in the dtstest.multi_pk_tbl1 table, after the table name mapping rule multi_pk_tbl1->tbl1 is configured, the corresponding downstream index name is: dtstest_tbl1。 Similarly, the downstream index corresponding to the dtstest.multi_pk_tbl2 table is named dtstest_tbl2. In the figure, the dtstest.multi_pk_tbl3 table is not configured with table name mapping, so the table name after the mapping is still multi_pk_tbl3, so the corresponding downstream index name is: dtstest_multi_pk_tbl3.

If you choose library-level migration, you can only configure library name mapping, rather than table name mapping. Downstream index name: Library name_table name after mapping. E.g.: There is a table of tbl1 under the test library. The task selects to migrate the test library and configures the library name mapping as: test->alpha. Then the newly created index in the downstream Elasticsearch instance is named: alpha_tbl1.

In addition, Elasticsearch requires that the index name must be all lowercase. Therefore, DTS will convert the upstream library table name to all lowercase index names by default. E.g.: The downstream Elasticsearch index name corresponding to the upstream dtsTEST.TESTTABLE table is: dtstest_testtable.

After completing the object mapping configuration, click [Save and pre-check] to start the pre-check of the task.

4.3 Pre-check and Start Migration

You can view the execution results of the pre-check on the DTS instance list page. If the check fails, you can click the eye-shaped icon next to the main status of the task to view the details of the pre-check and check the failure reason for the failed check.

After correcting the problem according to the method recommended by DTS, you can perform the pre-check again until the pre-check passes. Click [Start] on the right to start the task

4.4 Check Elasticsearch Instance Data

You can use the client provided by Elasticsearch or the REST interface to check the data during the data migration process. For example, when the destination Elasticsearch version is 7.6, query the record whose _id is 0 in the twitter index.

GET twitter/_doc/0

5. Data View

5.1 Business Scene

The data view is to combine the data from multiple entity tables of the source data source into one view, and save it in the same index of the destination Elasticsearch instance.

As shown in the figure above, there are three entity tables in the source data source: tbl_name、tbl_age、tbl_address。 Among them, the primary keys in the three tables are the same and all are pk_id, but the other fields in the tables are different. DTS supports the establishment of a data migration task that synchronizes data from multiple entity tables at the source to one or several indexes on the destination. The synchronization task uses pk_id as the index to specify the primary key, which supports synchronization of full or incremental data.

5.2 Usage Restriction

The tables in the same view all contain the primary key specified by the index. As shown in the figure above, the three tables tbl_name, tbl_age, and tbl_address all contain the index designated primary key pk_id.
The consistency of the source data view is guaranteed by the user. When adding and deleting data in a table in the same view, ensure atomicity as much as possible.
Do not support modifying the specified primary key column of the data view, which will cause the task to fail.

5.3 How to Configure Tasks

5.3.1 Connection Configuration

Back to the connection configuration page in the task configuration, the difference between the data view function and the normal migration task is that two more fields need to be configured.

Insertion method: Fixed selection upsert
Specify the primary key: The user needs to specify a primary key column in the data view, theformat is: Downstream index name 1 [primary key column]; downstream index name 2 [primary key column 1, primary key column 2].

As in the example in [5.1 Business Scenario] above. The primary key column of the data view composed of the three tables tbl_name, tbl_age, and tbl_address is pk_id. The Elasticsearch index name corresponding to the source data view is: dtstest_view, then the specified primary key should be configured as dtstest_view[pk_id].

In addition, synchronization rules for multiple data views can be configured in a data migration task. E.g.: A data migration task intends to configure the full migration + incremental migration of two data views dtstest_view1 (the primary key column is id) and dtstest_view2 (the primary key column is user_id and name). The specified primary key can be configured as: dtstest_view1[id];dtstest_view2[user_id,name]. Use semicolons (;) to connect between different views, and use commas (,) to connect between different primary key columns of the same view.

5.3.2 Object Mapping

In the object mapping configuration page, the difference between the data view function and the normal migration task is that the entity table of the same data view needs to be configured with table name mapping to ensure that the full data and incremental change records of multiple tables at the source can be written to the same Elasticsearch index in the destination.

As shown in the figure below, the three tables multipktbl1, multipktbl2 and multipktbl3 belong to the same data view, so the destination table name of the three tables is configured as view. The configured destination Elasticsearch index name is: Mapped library name_mapped table name (dtstest_view). The index name is consistent with the index name in the primary key configuration (dtstest_view[pk_id]) specified in the previous step.

Other configurations are the same as normal data migration tasks.

Use DTS to Implement MongoDB-MongoDB Data Migration

DTS Supports ET Migration

百度智能云

Database Transmission Server