Configuration and Usage

Updated at：2025-11-03

Simply download the corresponding SDK package and adjust some configurations to use the BOS HDFS tool.

Download

Download BOS FS JAR and copy the unzipped JAR package to $hadoop_dir/share/hadoop/common.

Preparation before use

Modify log4j.properties in the Hadoop configuration path to adjust the log configuration of the BOS SDK: log4j.logger.com.baidubce.http=WARN
Add or modify BOS HDFS-related configurations in the $hadoop_dir/etc/hadoop/core-site.xml file.

                XML
                
            

                <property>
  <name>fs.bos.access.key</name>
  <value>{Your AK}</value>
</property>
<property>
  <name>fs.bos.secret.access.key</name>
  <value>{Your SK}</value>
</property>
<property>
  <name>fs.bos.endpoint</name>
  <value>http://bj.bcebos.com</value>
</property>
<property>
  <name>fs.bos.impl</name>
  <value>org.apache.hadoop.fs.bos.BaiduBosFileSystem</value>
</property>
<property>
  <name>fs.AbstractFileSystem.bos.impl</name>
  <value>org.apache.hadoop.fs.bos.BOS</value>
</property>
            

Configurable attributes:

Name	Description
fs.bos.access.key	Required: AccessKey for BOS.
fs.bos.secret.access.key	Required: A secretKey is needed for BOS.
fs.bos.endpoint	Required: The endpoint where the BOS bucket is hosted.
fs.bos.session.token.key	Optional: Temporary access mode. If this is set, fs.bos.access.key and fs.bos.secret.access.key should also be temporary access keys.
fs.bos.credentials.provider	Optional: Credentials access mode. The available configurations are: org.apache.hadoop.fs.bos.credentials.BasicBOSCredentialsProvider and org.apache.hadoop.fs.bos.credentials.EnvironmentVariableCredentialsProvider. The default is org.apache.hadoop.fs.bos.credentials.BasicBOSCredentialsProvider.
fs.bos.bucket.{your bucket}.access.key	Optional: Specify the BOS accessKey for {your bucket}. This will take priority when accessing {your bucket}'s resources.
fs.bos.bucket.{your bucket}.secret.access.key	Optional: Specify the BOS secretKey for {your bucket}. This will take priority when accessing {your bucket}'s resources.
fs.bos.bucket.{your bucket}.session.token.key	Optional: Specify the temporary STS access mode for {your bucket}. If this is set, fs.bos.bucket.{your bucket}.access.key and fs.bos.bucket.{your bucket}.secret.access.key should also be temporary access keys. This setting takes priority when accessing {your bucket}'s resources.
fs.bos.bucket.{your bucket}.endpoint	Optional: Specify the endpoint of the BOS bucket where {your bucket} is located. This will take priority when accessing {your bucket}'s resources.
fs.bos.block.size	Optional: Simulate HDFS Block Size, with a default value of 128 MB. This is useful in scenarios like DistCp that require verifying block information.
fs.bos.max.connections	Optional: Set the maximum number of client-supported connections, with a default value of 1000.
fs.bos.multipart.uploads.block.size	Optional: Specify the part size (16,777,216 bytes) for multipart uploads by the client.
fs.bos.multipart.uploads.concurrent.size	Optional: The number of concurrent multipart uploads supported by the client, with a default value of 10.
fs.bos.object.dir.showtime	Optional: Display the modification time of directories when listing. Enabling this will increase interactions. The default value is false.
fs.bos.tail.cache.enable	Optional: Cache part of the data at the file's end to improve the performance of reading files like parquet or orc. The default value is true.
fs.bos.crc32c.checksum.enable	Optional: Enable CRC32C checksum verification. Once enabled, the system will calculate CRC32C for uploaded files.
fs.bos.bucket.hierarchy	Optional: Check if the bucket is of namespace type. A correct setting can save a verification step, while an incorrect one will block access to the bucket. If left as the default value, the setting will be automatically queried.
fs.bos.rename.enable	Optional: When enabled, metadata rename semantics are supported. During the current small-scale beta testing, you must request a ticket to confirm its availability; otherwise, the request will return a 403 authorization failure. The default value is false.

If using EnvironmentVariableCredentialsProvider, you need to set environment variables:

                Bash
                
            

                # Required; accessKey for BOS.
FS_BOS_ACCESS_KEY
# Required; secretKey for BOS.
FS_BOS_SECRET_ACCESS_KEY
# Optional, temporary access mode
FS_BOS_SESSION_TOKEN_KEY
            

Start using

When accessing BOS services using BOS HDFS, the path must start with bos://. For example:

                Bash
                
                $ hdfs dfs -ls bos://{bucket}/
$ hdfs dfs -put ${local_file} bos://{bucket}/a/b/c

Alternatively, configure the default file directory under $hadoop_dir/etc/core-site.xml

                XML
                
                <property>
  <name>fs.defaultFS</name>
  <value>bos://{bucket}</value>
</property>

Note: Configuring fs.defaultFS as BosFileSystem may cause a scheme check failure when starting NameNode and DataNode.

Recommendation: Configure fs.defaultFS only when using BosFileSystem with no need to start NameNode and DataNode. Otherwise, set the default HDFS address.

Just like using native HDFS:

Bash

1$ hdfs dfs -ls /

A practice of wordcount

1. Create data directory

Used to save input files for MapReduce tasks

Bash

1$ hdfs dfs -mkdir -p bos://test-bucket/data/wordcount

Used to save output files for MapReduce tasks

Bash

1$ hdfs dfs -mkdir bos://test-bucket/output

View the two newly created directories

Bash

1$ hdfs dfs -ls bos://test-bucket/

                Bash
                
                Found 2 items
drwxrwxrwx   -          0 1970-01-01 08:00 bos://test-bucket/data
drwxrwxrwx   -          0 1970-01-01 08:00 bos://test-bucket/output

If you want to display the accurate creation time of folders, you can configure it under $hadoop_dir/etc/core-site.xml

                XML
                
                <property>
  <name>fs.bos.object.dir.showtime</name>
  <value>true</value>
</property>

2. Write a word file and upload it to hdfs

Content of the word file

Bash

1$ cat words.txt

Bash

1hello baidu
2hello bos
3hadoop hdfs
4hello hadoop
5bos hdfs

Upload words.txt to hdfs

Bash

1$ hdfs dfs -put words.txt bos://test-bucket/data/wordcount

Check the uploaded file in hdfs

Bash

1$ hdfs dfs -cat bos://test-bucket/data/wordcount/words.txt

Bash

1hello baidu
2hello bos
3hadoop hdfs
4hello hadoop
5bos hdfs

3. Run the wordcount program

The built-in wordcount program of Hadoop is located in $hadoop_dir/share/hadoop/mapreduce/.

Bash

1$ hadoop jar hadoop-mapreduce-examples-2.7.7.jar wordcount bos://test-bucket/data/wordcount bos://test-bucket/output/wordcount

4. View statistical results

                Bash
                
                $ hdfs dfs -ls bos://test-bucket/output/wordcount/
-rw-rw-rw-   1          0 2020-06-12 16:55 bos://test-bucket/output/wordcount/_SUCCESS
-rw-rw-rw-   1         61 2020-06-12 16:55 bos://test-bucket/output/wordcount/part-r-00000

                Bash
                
            

                $ hdfs dfs -cat bos://test-bucket/output/wordcount/part-r-00000
baidu	1
bos	    2
hadoop	2
hdfs	2
hello	3
            

Hadoop DistCp usage

DistCp (distributed copy) is a Hadoop tool for large-scale file copying between clusters or within a cluster. It uses Map/Reduce to handle file distribution, error management, recovery, and report generation. The tool takes a list of files and directories as input for map tasks, and each task processes part of the file list.

For copying data between HDFS clusters and BOS, you can use the BOS HDFS tool, similar to the standard Hadoop DistCp.

Preparations

As mentioned earlier, download the BOS HDFS-related jar packages and make the necessary configurations.

Using an HDFS cluster as the data source location and BOS as the destination, verify whether both the source and destination can be read and written to properly.

Bash

1$ hadoop fs -get hdfs://host:port/xxx ./
2$ hadoop fs -put xxx bos://bucket/xxx

Start copying

Ordinary copy

Bash

1# Copy hdfs from hdfs to the dst path under the specified BOS bucket. By default, existing target files will be skipped
2$ hadoop distcp hdfs://host:port/src bos://bucket/dst

Note: To enable CRC verification for data before and after copying, ensure that BOS HDFS sets fs.bos.block.size to match the source HDFS and enables fs.bos.crc32c.checksum.enable. Currently, only HDFS's dfs.checksum.combine.mode=COMPOSITE_CRC verification algorithm is supported.

Update and overwrite

Bash

1# Copy, the only standard for overwriting is whether the source and target file sizes are the same. If they differ, the source file replaces the target file
2$ hadoop distcp -update hdfs://host:port/src bos://bucket/dst
3# Copy and overwrite existing target files
4$ hadoop distcp -overwrite hdfs://host:port/src bos://bucket/dst

Copy multiple sources

                Bash
                
            

                # Specify multiple source paths
$ hadoop distcp hdfs://host:port/src1 hdfs://host:port/src2 bos://bucket/dst
# Obtain multiple sources from a file
# The content of srcList consists of multi-line source paths like hdfs://host:port/src1, hdfs://host:port/src2
$ hadoop distcp hdfs://host:port/srcList bos://bucket/dst
            

Note: When copying data from multiple sources, conflicts may arise; refer to the conflict resolution methods for standard Hadoop DistCp.

More configuration

                Bash
                
            

                # Specify the count of maps for data copying
# Increasing the count of maps may not improve data throughput but could lead to some issues. The count of maps should be set based on cluster resources and the scale of copied data
$ hadoop distcp -m 10 hdfs://host:port/src bos://bucket/dst
# Ignore failed maps but retain failure operation logs
$ hadoop distcp -i hdfs://host:port/src bos://bucket/dst
# Dynamic task allocation
# The default allocation strategy is based on file size. During update copying, since some files are skipped, there is a problem of “data skew” during the actual copy operation, and the map with the longest latency slows down overall progress
$ hadoop distcp -strategy dynamic -update hdfs://host:port/src bos://bucket/dst
            

For more configuration parameters, enter the hadoop distcp command to get help information.

Advanced usage

Due to the limited scalability of self-built hadoop clusters and the need for a lot of manpower to maintain the cluster, if you have higher requirements for performance and security, it is recommended to use Baidu MapReduce (BMR) provided by Baidu AI Cloud. BMR is a fully managed Hadoop/Spark cluster that can be deployed and elastically scaled on demand. You only need to focus on big data processing, analysis and reporting. The Baidu O&M team, with years of experience in large-scale distributed computing technology, fully manages cluster operation and maintenance, significantly improving performance, security and convenience.

FAQs

Data Migration and Transfer