Configuration and Usage
Simply download the corresponding SDK package and adjust some configurations to use the BOS HDFS tool.
Download
- Download BOS FS JAR and copy the unzipped JAR package to
$hadoop_dir/share/hadoop/common.
Preparation before use
- Modify
log4j.propertiesin the Hadoop configuration path to adjust the log configuration of the BOS SDK:log4j.logger.com.baidubce.http=WARN - Add or modify BOS HDFS-related configurations in the
$hadoop_dir/etc/hadoop/core-site.xmlfile.
1<property>
2 <name>fs.bos.access.key</name>
3 <value>{Your AK}</value>
4</property>
5<property>
6 <name>fs.bos.secret.access.key</name>
7 <value>{Your SK}</value>
8</property>
9<property>
10 <name>fs.bos.endpoint</name>
11 <value>http://bj.bcebos.com</value>
12</property>
13<property>
14 <name>fs.bos.impl</name>
15 <value>org.apache.hadoop.fs.bos.BaiduBosFileSystem</value>
16</property>
17<property>
18 <name>fs.AbstractFileSystem.bos.impl</name>
19 <value>org.apache.hadoop.fs.bos.BOS</value>
20</property>
Configurable attributes:
| Name | Description |
|---|---|
| fs.bos.access.key | Required: AccessKey for BOS. |
| fs.bos.secret.access.key | Required: A secretKey is needed for BOS. |
| fs.bos.endpoint | Required: The endpoint where the BOS bucket is hosted. |
| fs.bos.session.token.key | Optional: Temporary access mode. If this is set, fs.bos.access.key and fs.bos.secret.access.key should also be temporary access keys. |
| fs.bos.credentials.provider | Optional: Credentials access mode. The available configurations are: org.apache.hadoop.fs.bos.credentials.BasicBOSCredentialsProvider and org.apache.hadoop.fs.bos.credentials.EnvironmentVariableCredentialsProvider. The default is org.apache.hadoop.fs.bos.credentials.BasicBOSCredentialsProvider. |
| fs.bos.bucket.{your bucket}.access.key | Optional: Specify the BOS accessKey for {your bucket}. This will take priority when accessing {your bucket}'s resources. |
| fs.bos.bucket.{your bucket}.secret.access.key | Optional: Specify the BOS secretKey for {your bucket}. This will take priority when accessing {your bucket}'s resources. |
| fs.bos.bucket.{your bucket}.session.token.key | Optional: Specify the temporary STS access mode for {your bucket}. If this is set, fs.bos.bucket.{your bucket}.access.key and fs.bos.bucket.{your bucket}.secret.access.key should also be temporary access keys. This setting takes priority when accessing {your bucket}'s resources. |
| fs.bos.bucket.{your bucket}.endpoint | Optional: Specify the endpoint of the BOS bucket where {your bucket} is located. This will take priority when accessing {your bucket}'s resources. |
| fs.bos.block.size | Optional: Simulate HDFS Block Size, with a default value of 128 MB. This is useful in scenarios like DistCp that require verifying block information. |
| fs.bos.max.connections | Optional: Set the maximum number of client-supported connections, with a default value of 1000. |
| fs.bos.multipart.uploads.block.size | Optional: Specify the part size (16,777,216 bytes) for multipart uploads by the client. |
| fs.bos.multipart.uploads.concurrent.size | Optional: The number of concurrent multipart uploads supported by the client, with a default value of 10. |
| fs.bos.object.dir.showtime | Optional: Display the modification time of directories when listing. Enabling this will increase interactions. The default value is false. |
| fs.bos.tail.cache.enable | Optional: Cache part of the data at the file's end to improve the performance of reading files like parquet or orc. The default value is true. |
| fs.bos.crc32c.checksum.enable | Optional: Enable CRC32C checksum verification. Once enabled, the system will calculate CRC32C for uploaded files. |
| fs.bos.bucket.hierarchy | Optional: Check if the bucket is of namespace type. A correct setting can save a verification step, while an incorrect one will block access to the bucket. If left as the default value, the setting will be automatically queried. |
| fs.bos.rename.enable | Optional: When enabled, metadata rename semantics are supported. During the current small-scale beta testing, you must request a ticket to confirm its availability; otherwise, the request will return a 403 authorization failure. The default value is false. |
If using EnvironmentVariableCredentialsProvider, you need to set environment variables:
1# Required; accessKey for BOS.
2FS_BOS_ACCESS_KEY
3# Required; secretKey for BOS.
4FS_BOS_SECRET_ACCESS_KEY
5# Optional, temporary access mode
6FS_BOS_SESSION_TOKEN_KEY
Start using
When accessing BOS services using BOS HDFS, the path must start with bos://. For example:
1$ hdfs dfs -ls bos://{bucket}/
2$ hdfs dfs -put ${local_file} bos://{bucket}/a/b/c
Alternatively, configure the default file directory under $hadoop_dir/etc/core-site.xml
1<property>
2 <name>fs.defaultFS</name>
3 <value>bos://{bucket}</value>
4</property>
Note: Configuring fs.defaultFS as BosFileSystem may cause a scheme check failure when starting NameNode and DataNode.
Recommendation: Configure fs.defaultFS only when using BosFileSystem with no need to start NameNode and DataNode. Otherwise, set the default HDFS address.
Just like using native HDFS:
1$ hdfs dfs -ls /
A practice of wordcount
1. Create data directory
Used to save input files for MapReduce tasks
1$ hdfs dfs -mkdir -p bos://test-bucket/data/wordcount
Used to save output files for MapReduce tasks
1$ hdfs dfs -mkdir bos://test-bucket/output
View the two newly created directories
1$ hdfs dfs -ls bos://test-bucket/
1Found 2 items
2drwxrwxrwx - 0 1970-01-01 08:00 bos://test-bucket/data
3drwxrwxrwx - 0 1970-01-01 08:00 bos://test-bucket/output
If you want to display the accurate creation time of folders, you can configure it under $hadoop_dir/etc/core-site.xml
1<property>
2 <name>fs.bos.object.dir.showtime</name>
3 <value>true</value>
4</property>
2. Write a word file and upload it to hdfs
Content of the word file
1$ cat words.txt
1hello baidu
2hello bos
3hadoop hdfs
4hello hadoop
5bos hdfs
Upload words.txt to hdfs
1$ hdfs dfs -put words.txt bos://test-bucket/data/wordcount
Check the uploaded file in hdfs
1$ hdfs dfs -cat bos://test-bucket/data/wordcount/words.txt
1hello baidu
2hello bos
3hadoop hdfs
4hello hadoop
5bos hdfs
3. Run the wordcount program
The built-in wordcount program of Hadoop is located in $hadoop_dir/share/hadoop/mapreduce/.
1$ hadoop jar hadoop-mapreduce-examples-2.7.7.jar wordcount bos://test-bucket/data/wordcount bos://test-bucket/output/wordcount
4. View statistical results
1$ hdfs dfs -ls bos://test-bucket/output/wordcount/
2-rw-rw-rw- 1 0 2020-06-12 16:55 bos://test-bucket/output/wordcount/_SUCCESS
3-rw-rw-rw- 1 61 2020-06-12 16:55 bos://test-bucket/output/wordcount/part-r-00000
1$ hdfs dfs -cat bos://test-bucket/output/wordcount/part-r-00000
2baidu 1
3bos 2
4hadoop 2
5hdfs 2
6hello 3
Hadoop DistCp usage
DistCp (distributed copy) is a Hadoop tool for large-scale file copying between clusters or within a cluster. It uses Map/Reduce to handle file distribution, error management, recovery, and report generation. The tool takes a list of files and directories as input for map tasks, and each task processes part of the file list.
For copying data between HDFS clusters and BOS, you can use the BOS HDFS tool, similar to the standard Hadoop DistCp.
Preparations
As mentioned earlier, download the BOS HDFS-related jar packages and make the necessary configurations.
Using an HDFS cluster as the data source location and BOS as the destination, verify whether both the source and destination can be read and written to properly.
1$ hadoop fs -get hdfs://host:port/xxx ./
2$ hadoop fs -put xxx bos://bucket/xxx
Start copying
Ordinary copy
1# Copy hdfs from hdfs to the dst path under the specified BOS bucket. By default, existing target files will be skipped
2$ hadoop distcp hdfs://host:port/src bos://bucket/dst
Note: To enable CRC verification for data before and after copying, ensure that BOS HDFS sets fs.bos.block.size to match the source HDFS and enables fs.bos.crc32c.checksum.enable. Currently, only HDFS's dfs.checksum.combine.mode=COMPOSITE_CRC verification algorithm is supported.
Update and overwrite
1# Copy, the only standard for overwriting is whether the source and target file sizes are the same. If they differ, the source file replaces the target file
2$ hadoop distcp -update hdfs://host:port/src bos://bucket/dst
3# Copy and overwrite existing target files
4$ hadoop distcp -overwrite hdfs://host:port/src bos://bucket/dst
Copy multiple sources
1# Specify multiple source paths
2$ hadoop distcp hdfs://host:port/src1 hdfs://host:port/src2 bos://bucket/dst
3# Obtain multiple sources from a file
4# The content of srcList consists of multi-line source paths like hdfs://host:port/src1, hdfs://host:port/src2
5$ hadoop distcp hdfs://host:port/srcList bos://bucket/dst
Note: When copying data from multiple sources, conflicts may arise; refer to the conflict resolution methods for standard Hadoop DistCp.
More configuration
1# Specify the count of maps for data copying
2# Increasing the count of maps may not improve data throughput but could lead to some issues. The count of maps should be set based on cluster resources and the scale of copied data
3$ hadoop distcp -m 10 hdfs://host:port/src bos://bucket/dst
4# Ignore failed maps but retain failure operation logs
5$ hadoop distcp -i hdfs://host:port/src bos://bucket/dst
6# Dynamic task allocation
7# The default allocation strategy is based on file size. During update copying, since some files are skipped, there is a problem of “data skew” during the actual copy operation, and the map with the longest latency slows down overall progress
8$ hadoop distcp -strategy dynamic -update hdfs://host:port/src bos://bucket/dst
For more configuration parameters, enter the hadoop distcp command to get help information.
Advanced usage
Due to the limited scalability of self-built hadoop clusters and the need for a lot of manpower to maintain the cluster, if you have higher requirements for performance and security, it is recommended to use Baidu MapReduce (BMR) provided by Baidu AI Cloud. BMR is a fully managed Hadoop/Spark cluster that can be deployed and elastically scaled on demand. You only need to focus on big data processing, analysis and reporting. The Baidu O&M team, with years of experience in large-scale distributed computing technology, fully manages cluster operation and maintenance, significantly improving performance, security and convenience.
