Configuration and Usage
You only need to download the corresponding SDK package and modify a few configurations to use the BOS HDFS tool.
Download
- Download BOS FS JAR and copy the unzipped JAR package to
$hadoop_dir/share/hadoop/common.
Preparation before use
- Modify
log4j.propertiesin the Hadoop configuration path to adjust the log configuration of the BOS SDK:log4j.logger.com.baidubce.http=WARN - Add or modify BOS HDFS-related configurations in the
$hadoop_dir/etc/hadoop/core-site.xmlfile.
1<property>
2 <name>fs.bos.access.key</name>
3 <value>{Your AK}</value>
4</property>
5<property>
6 <name>fs.bos.secret.access.key</name>
7 <value>{Your SK}</value>
8</property>
9<property>
10 <name>fs.bos.endpoint</name>
11 <value>http://bj.bcebos.com</value>
12</property>
13<property>
14 <name>fs.bos.impl</name>
15 <value>org.apache.hadoop.fs.bos.BaiduBosFileSystem</value>
16</property>
17<property>
18 <name>fs.AbstractFileSystem.bos.impl</name>
19 <value>org.apache.hadoop.fs.bos.BOS</value>
20</property>
Configurable attributes:
| Name | Description |
|---|---|
| fs.bos.access.key | Required: AccessKey for BOS. |
| fs.bos.secret.access.key | Required: SecretKey for BOS. |
| fs.bos.endpoint | Required: The endpoint where the BOS bucket is hosted. |
| fs.bos.session.token.key | Optional: Temporary access mode. If configured, fs.bos.access.key and fs.bos.secret.access.key should also be temporary access keys. |
| fs.bos.credentials.provider | Optional: Credentials access mode. Configurable values include: org.apache.hadoop.fs.bos.credentials.BasicBOSCredentialsProvider, org.apache.hadoop.fs.bos.credentials.EnvironmentVariableCredentialsProvider. The default value is org.apache.hadoop.fs.bos.credentials.BasicBOSCredentialsProvider. |
| fs.bos.bucket.{your bucket}.access.key | Optional: Specify the BOS AccessKey for {your bucket}. This setting will take priority when accessing {your bucket} resources. |
| fs.bos.bucket.{your bucket}.secret.access.key | Optional: Specify the BOS SecretKey for {your bucket}. This setting will take priority when accessing {your bucket} resources. |
| fs.bos.bucket.{your bucket}.session.token.key | Optional: Specify the temporary STS access mode for {your bucket}. If configured, fs.bos.bucket.{your bucket}.access.key and fs.bos.bucket.{your bucket}.secret.access.key should also be temporary access keys. This setting will take priority when accessing {your bucket} resources. |
| fs.bos.bucket.{your bucket}.endpoint | Optional: Specify the endpoint of the BOS bucket where {your bucket} is located. This setting will take priority when accessing {your bucket} resources. |
| fs.bos.block.size | Optional: Simulate HDFS block size, with a default of 128 MB; this is useful for scenarios like DistCp that require block information validation. |
| fs.bos.max.connections | Optional: Set the maximum number of client-supported connections, with a default value of 1000. |
| fs.bos.multipart.uploads.block.size | Optional: Specify the part size (16,777,216 bytes) for multipart uploads by the client. |
| fs.bos.multipart.uploads.concurrent.size | Optional: The number of concurrent multipart uploads supported by the client, with a default value of 10. |
| fs.bos.object.dir.showtime | Optional: Display the modification time of directories while listing. Enabling this increases the number of interactions; the default is false. |
| fs.bos.tail.cache.enable | Optional: Cache part of the data at the end of files to enhance resolution performance for file formats like Parquet/ORC; the default is true. |
| fs.bos.crc32c.checksum.enable | Optional: Enable CRC32C checksum verification. When enabled, CRC32C will be computed for uploaded files. |
| fs.bos.bucket.hierarchy | Optional: Check if the bucket is of namespace type. Correct settings can save a verification step, while incorrect ones will cause bucket access failures. If set to the default, this is determined automatically. |
| fs.bos.rename.enable | Optional: Enable metadata rename semantics. Currently, during the small-scale closed beta testing phase, a ticket must be issued to confirm usage; otherwise, requests will return a 403 authorization failure. The default is false. |
If using EnvironmentVariableCredentialsProvider, you need to set environment variables:
1# Required; accessKey for BOS.
2FS_BOS_ACCESS_KEY
3# Required; secretKey for BOS.
4FS_BOS_SECRET_ACCESS_KEY
5# Optional, temporary access mode
6FS_BOS_SESSION_TOKEN_KEY
Start using
When accessing BOS services using BOS HDFS, the path must start with bos://. For example:
1$ hdfs dfs -ls bos://{bucket}/
2$ hdfs dfs -put ${local_file} bos://{bucket}/a/b/c
Alternatively, configure the default file directory under $hadoop_dir/etc/core-site.xml
1<property>
2 <name>fs.defaultFS</name>
3 <value>bos://{bucket}</value>
4</property>
Note: When fs.defaultFS is set to BosFileSystem, starting NameNode and DataNode might result in a scheme check failure.
It is advised to configure fs.defaultFS only if you're using BosFileSystem and don't need to start NameNode and DataNode. Otherwise, set the default HDFS address instead.
Just like using native HDFS:
1$ hdfs dfs -ls /
A practice of wordcount
1. Create data directory
Used to save input files for MapReduce tasks
1$ hdfs dfs -mkdir -p bos://test-bucket/data/wordcount
Used to save output files for MapReduce tasks
1$ hdfs dfs -mkdir bos://test-bucket/output
View the two newly created directories
1$ hdfs dfs -ls bos://test-bucket/
1Found 2 items
2drwxrwxrwx - 0 1970-01-01 08:00 bos://test-bucket/data
3drwxrwxrwx - 0 1970-01-01 08:00 bos://test-bucket/output
If you want to display the accurate creation time of folders, you can configure it under $hadoop_dir/etc/core-site.xml
1<property>
2 <name>fs.bos.object.dir.showtime</name>
3 <value>true</value>
4</property>
2. Write a word file and upload it to hdfs
Content of the word file
1$ cat words.txt
1hello baidu
2hello bos
3hadoop hdfs
4hello hadoop
5bos hdfs
Upload words.txt to hdfs
1$ hdfs dfs -put words.txt bos://test-bucket/data/wordcount
Check the uploaded file in hdfs
1$ hdfs dfs -cat bos://test-bucket/data/wordcount/words.txt
1hello baidu
2hello bos
3hadoop hdfs
4hello hadoop
5bos hdfs
3. Run the wordcount program
The built-in wordcount program of Hadoop is located in $hadoop_dir/share/hadoop/mapreduce/.
1$ hadoop jar hadoop-mapreduce-examples-2.7.7.jar wordcount bos://test-bucket/data/wordcount bos://test-bucket/output/wordcount
4. View statistical results
1$ hdfs dfs -ls bos://test-bucket/output/wordcount/
2-rw-rw-rw- 1 0 2020-06-12 16:55 bos://test-bucket/output/wordcount/_SUCCESS
3-rw-rw-rw- 1 61 2020-06-12 16:55 bos://test-bucket/output/wordcount/part-r-00000
1$ hdfs dfs -cat bos://test-bucket/output/wordcount/part-r-00000
2baidu 1
3bos 2
4hadoop 2
5hdfs 2
6hello 3
Usage of Hadoop Distcp
DistCp (distributed copy) is a built-in Hadoop tool designed for large-scale data transfers between and within clusters. It uses Map/Reduce for file distribution, error handling, recovery, and report generation. DistCp processes a list of files and directories as input and assigns map tasks, with each task handling a portion of the files from the source list.
For copying data between HDFS clusters and BOS, you can use the BOS HDFS tool, just as you would with the standard Hadoop Distcp.
Preparations
As explained above, download the BOS HDFS-related JAR files and make the necessary configurations.
Using an example where the source is an HDFS cluster and the destination is BOS, verify that both the source and destination are accessible for reading and writing as expected.
1$ hadoop fs -get hdfs://host:port/xxx ./
2$ hadoop fs -put xxx bos://bucket/xxx
Start copying
Ordinary copy
1# Copy hdfs from hdfs to the dst path under the specified BOS bucket. By default, existing target files will be skipped
2$ hadoop distcp hdfs://host:port/src bos://bucket/dst
Note: To perform CRC verification between source and destination data, BOS HDFS must set fs.bos.block.size to match the source HDFS and enable fs.bos.crc32c.checksum.enable. Only the dfs.checksum.combine.mode=COMPOSITE_CRC verification algorithm of HDFS is supported.
Update and overwrite
1# Copy, the only standard for overwriting is whether the source and target file sizes are the same. If they differ, the source file replaces the target file
2$ hadoop distcp -update hdfs://host:port/src bos://bucket/dst
3# Copy and overwrite existing target files
4$ hadoop distcp -overwrite hdfs://host:port/src bos://bucket/dst
Copy multiple sources
1# Specify multiple source paths
2$ hadoop distcp hdfs://host:port/src1 hdfs://host:port/src2 bos://bucket/dst
3# Obtain multiple sources from a file
4# The content of srcList consists of multi-line source paths like hdfs://host:port/src1, hdfs://host:port/src2
5$ hadoop distcp hdfs://host:port/srcList bos://bucket/dst
Note: Copying data from multiple sources may lead to conflicts; consult the conflict handling guidelines for standard Hadoop Distcp.
More configuration
1# Specify the count of maps for data copying
2# Increasing the count of maps may not improve data throughput but could lead to some issues. The count of maps should be set based on cluster resources and the scale of copied data
3$ hadoop distcp -m 10 hdfs://host:port/src bos://bucket/dst
4# Ignore failed maps but retain failure operation logs
5$ hadoop distcp -i hdfs://host:port/src bos://bucket/dst
6# Dynamic task allocation
7# The default allocation strategy is based on file size. During update copying, since some files are skipped, there is a problem of “data skew” during the actual copy operation, and the map with the longest latency slows down overall progress
8$ hadoop distcp -strategy dynamic -update hdfs://host:port/src bos://bucket/dst
For additional configuration parameters, run the Hadoop Distcp command to access help information.
Advanced usage
Due to the limited scalability of self-built hadoop clusters and the need for a lot of manpower to maintain the cluster, if you have higher requirements for performance and security, it is recommended to use Baidu MapReduce (BMR) provided by Baidu AI Cloud. BMR is a fully managed Hadoop/Spark cluster that can be deployed and elastically scaled on demand. You only need to focus on big data processing, analysis and reporting. The Baidu O&M team, with years of experience in large-scale distributed computing technology, fully manages cluster operation and maintenance, significantly improving performance, security and convenience.
