Backup and Recovery

Last Updated：2021-04-13

The backup and restoring function is mainly used to quickly back up the cluster snapshot to the remote storage and quickly recover from the backup data when needed.

The backup function, unlike Data export function, is better than the export function in terms of overall speed because it directly copies the data file to the remote storage, however, the backup data can only be used for the restoring function of Palo itself, and the exported data can be read and used by other systems.

Basic Concepts

Repository

The user needs to create a Repository, which is a mapping of a directory on a remote storage system to Palo, before backup and restoring. The backup operation uploads data to this path, while the restoring operation downloads data from this path.

Palo supports the creation and deletion of repository. Refer to CREATE REPOSITORY and DROP REPOSITORY command manual for specific help. View the created repository through SHOW REPOSITORIES command.
Backup

The backup operation, with minimum partition granularity, can be directly uploaded to the remote repository for storage in the form of Palo files. The system will perform the following operations after the user submits a backup request:
1. Snapshot and its upload
  
  Specified tables or partition data file can be snapshotted during snapshot phase. After that, the backup will be performed for snapshot. After the snapshot, changes and import of the table and other operations will no longer affect the results of the backup. Snapshot is just a hard chain for the current data file, which takes little time. After the completion of snapshot, the snapshot files will be uploaded one by one, which is concurrently completed by each Compute Node.
2. Preparation and upload of metadata
  
  After the data file snapshot is uploaded, Leader Node will first write the corresponding metadata as a local file, and then upload the local metadata file to the remote repository to finish final backup job.
Refer to BACKUP syntax manual for specific operations for backup. The backup operation is an asynchronous operation, whose progress can be viewed through SHOW BACKUP command. At the same time, it can also cancel a running backup operation through CANCEL BACKUP command.
Restore

The restoring operation needs to specify a pre-existing backup in the remote repository, and then restores backup contents to the local cluster. The system will perform the following operations when the user submits a Restore request:
1. Create the corresponding metadata locally
  
  In this step, the corresponding table partitions and other structures will be created and restored in the local cluster. After creation, the table is visible but not accessible.
2. Download snapshot
  
  Download the snapshot file from the remote repository to the corresponding Compute Node.
3. Effective snapshot
  
  After downloading the snapshot, we need to map each snapshot to the metadata of the current local table. Then reload these snapshots to put them into effect and complete final restoring job.
Refer to RESTORE syntax manual for specific restoring operations. The restoring operation is an asynchronous operation, whose progress can be viewed through SHOW-RESTORE command. Also, a running restoring operation can be canceled through CANCEL RESTORE command.

Operation Examples

With a complete example, Let's show how to migrate cluster A data to cluster B through backup and restoring operation.

Create a repository in cluster A

CREATE REPOSITORY `bos_repo`
WITH BROKER `bos`
ON LOCATION "bos://my_bucket/palo_backup"
PROPERTIES
(
    "bos_endpoint" = "http://bj.bcebos.com",
    "bos_accesskey" = "xxxxxxxxxxxxxxxxxx",
    "bos_secret_accesskey"="yyyyyyyyyyyyyyy"
);

Create a repository with the name of bos_repo, pointing to palo_backup directory. Refer to CREATE REPOSITORY for more detailed help.

Backup data in cluster A
```
BACKUP SNAPSHOT example_db.snapshot1
TO `bos_repo`
ON
(
    example_tbl PARTITION (p1,p2),
    example_tbl2
);
```
Specify two partitions of table inexample_tbl database example_db and table example_tbl2 for backup and then back them up to repository bos_repo with the backup name being snapshot1 this time. Refer to BACKUP for more detailed help of backup operation.

The backup operation is an asynchronous operation, whose progress can be viewed through SHOW BACKUP command. When the field State in returned result is FINISHED, the backup is completed.

Create an identical repository in cluster B:

CREATE REPOSITORY `bos_repo`
WITH BROKER `bos`
ON LOCATION "bos://my_bucket/palo_backup"
PROPERTIES
(
    "bos_endpoint" = "http://bj.bcebos.com",
    "bos_accesskey" = "xxxxxxxxxxxxxxxxxx",
    "bos_secret_accesskey"="yyyyyyyyyyyyyyy"
);

View the backup snapshot of the repository in cluster B
```
SHOW SNAPSHOT ON `bos_repo`;
```
Refer to SHOW SNAPSHOT syntax manual for more help.
Restore data in cluster B
```
RESTORE SNAPSHOT example_db.`snapshot1`
FROM `bos_repo`
ON
(
    `example_tbl2`
)
PROPERTIES
(
    "backup_timestamp"="2020-05-04-16-45-08",
    "replication_num" = "1"
);
```
Specify backup data called snapshot1 in bos_repo to select to restore its table example_tbl2.

Each backup data has a timestamp（backup_timestamp）, which needs to be specified to show. Here, we specify to restore only one copy.

The restoring operation is also an asynchronous operation, whose specific progress can be viewed through SHOW RESTORE command. When the field State in returned result is FINISHED, the restoring is completed.

Best Practice

Backup

Currently, we support full backup with minimum partition granularity. The user first needs to plan the partition and bucket of the table reasonably, planning partition by time for example, if there is a need for regular data backup. Then carry out regular data backup according to the partition granularity in the later running processes.

Incremental backup can be achieved by backup according to partition granularity.

Data Migration

In order to complete data migration, the user can back up the data to the remote repository first, and then restore the data to another cluster through the remote repository. Because data backup is completed in snapshot form, new imported data after snapshot phase of backup job will not be backed up. Therefore, during the snapshot completion and restoring job completion, the data imported on the original cluster needs to be imported on the new cluster.

It is recommended to concurrently import the new and old clusters for a period of time after the migration. The services will be migrated to a new cluster after the verification of data and service correctness.

BI Tool Access

Materialized View

百度智能云

Data Warehouse