Migrate Data to Hierarchical Namespace Bucket Using DistCp
Hadoop DistCp usage
DistCp, a tool built into Hadoop, enables large-scale file copying across and within clusters. It leverages Map/Reduce to handle file distribution, error recovery, and reporting. It processes lists of files and directories through map tasks, with each task handling a portion of the files in the source list.
For copying data between HDFS clusters and BOS, you can use the BOS HDFS tool, similar to the standard Hadoop DistCp.
Preparations
As mentioned earlier, download the BOS HDFS-related jar packages and make the necessary configurations.
Using an HDFS cluster as the data source location and BOS as the destination, verify whether both the source and destination can be read and written to properly.
1$ hadoop fs -get hdfs://host:port/xxx ./
2$ hadoop fs -put xxx bos://bucket/xxx
Start copying
Ordinary copy
1# Copy hdfs from hdfs to the dst path under the specified BOS bucket. By default, existing target files will be skipped
2$ hadoop distcp hdfs://host:port/src bos://bucket/dst
Note: To enable CRC verification of data before and after copying, BOS HDFS must configure fs.bos.block.size to match the source HDFS and enable fs.bos.crc32c.checksum.enable. Only the dfs.checksum.combine.mode=COMPOSITE_CRC verification algorithm of HDFS is supported.
Update and overwrite
1# Copy, the only standard for overwriting is whether the source and target file sizes are the same. If they differ, the source file replaces the target file
2$ hadoop distcp -update hdfs://host:port/src bos://bucket/dst
3# Copy and overwrite existing target files
4$ hadoop distcp -overwrite hdfs://host:port/src bos://bucket/dst
Copy multiple sources
1# Specify multiple source paths
2$ hadoop distcp hdfs://host:port/src1 hdfs://host:port/src2 bos://bucket/dst
3# Obtain multiple sources from a file
4# The content of srcList consists of multi-line source paths like hdfs://host:port/src1, hdfs://host:port/src2
5$ hadoop distcp hdfs://host:port/srcList bos://bucket/dst
Note: Copying from multiple sources may result in conflicts. Refer to the standard Hadoop DistCp documentation for conflict resolution.
More configuration
1# Specify the count of maps for data copying
2# Increasing the count of maps may not improve data throughput but could lead to some issues. The count of maps can be set based on cluster resources and the scale of copied data
3$ hadoop distcp -m 10 hdfs://host:port/src bos://bucket/dst
4# Ignore failed maps but retain failure operation logs
5$ hadoop distcp -i hdfs://host:port/src bos://bucket/dst
6# Dynamic task allocation
7# The default allocation strategy is based on file size. During update copying, since some files are skipped, there is a problem of “data skew” during the actual copy operation, and the map with the longest latency slows down overall progress
8$ hadoop distcp -strategy dynamic -update hdfs://host:port/src bos://bucket/dst
For more configuration parameters, enter the hadoop distcp command to get help information.
