Configuration and Usage

Updated at：2025-11-03

You only need to download the corresponding SDK package and modify a few configurations to use the BOS HDFS tool.

Download

Download BOS FS JAR and copy the unzipped JAR package to $hadoop_dir/share/hadoop/common.

Preparation before use

Modify log4j.properties in the Hadoop configuration path to adjust the log configuration of the BOS SDK: log4j.logger.com.baidubce.http=WARN
Add or modify BOS HDFS-related configurations in the $hadoop_dir/etc/hadoop/core-site.xml file.

                XML
                
            

                <property>
  <name>fs.bos.access.key</name>
  <value>{Your AK}</value>
</property>
<property>
  <name>fs.bos.secret.access.key</name>
  <value>{Your SK}</value>
</property>
<property>
  <name>fs.bos.endpoint</name>
  <value>http://bj.bcebos.com</value>
</property>
<property>
  <name>fs.bos.impl</name>
  <value>org.apache.hadoop.fs.bos.BaiduBosFileSystem</value>
</property>
<property>
  <name>fs.AbstractFileSystem.bos.impl</name>
  <value>org.apache.hadoop.fs.bos.BOS</value>
</property>
            

Configurable attributes:

Name	Description
fs.bos.access.key	Required: AccessKey for BOS.
fs.bos.secret.access.key	Required: SecretKey for BOS.
fs.bos.endpoint	Required: The endpoint where the BOS bucket is hosted.
fs.bos.session.token.key	Optional: Temporary access mode. If configured, fs.bos.access.key and fs.bos.secret.access.key should also be temporary access keys.
fs.bos.credentials.provider	Optional: Credentials access mode. Configurable values include: org.apache.hadoop.fs.bos.credentials.BasicBOSCredentialsProvider, org.apache.hadoop.fs.bos.credentials.EnvironmentVariableCredentialsProvider. The default value is org.apache.hadoop.fs.bos.credentials.BasicBOSCredentialsProvider.
fs.bos.bucket.{your bucket}.access.key	Optional: Specify the BOS AccessKey for {your bucket}. This setting will take priority when accessing {your bucket} resources.
fs.bos.bucket.{your bucket}.secret.access.key	Optional: Specify the BOS SecretKey for {your bucket}. This setting will take priority when accessing {your bucket} resources.
fs.bos.bucket.{your bucket}.session.token.key	Optional: Specify the temporary STS access mode for {your bucket}. If configured, fs.bos.bucket.{your bucket}.access.key and fs.bos.bucket.{your bucket}.secret.access.key should also be temporary access keys. This setting will take priority when accessing {your bucket} resources.
fs.bos.bucket.{your bucket}.endpoint	Optional: Specify the endpoint of the BOS bucket where {your bucket} is located. This setting will take priority when accessing {your bucket} resources.
fs.bos.block.size	Optional: Simulate HDFS block size, with a default of 128 MB; this is useful for scenarios like DistCp that require block information validation.
fs.bos.max.connections	Optional: Set the maximum number of client-supported connections, with a default value of 1000.
fs.bos.multipart.uploads.block.size	Optional: Specify the part size (16,777,216 bytes) for multipart uploads by the client.
fs.bos.multipart.uploads.concurrent.size	Optional: The number of concurrent multipart uploads supported by the client, with a default value of 10.
fs.bos.object.dir.showtime	Optional: Display the modification time of directories while listing. Enabling this increases the number of interactions; the default is false.
fs.bos.tail.cache.enable	Optional: Cache part of the data at the end of files to enhance resolution performance for file formats like Parquet/ORC; the default is true.
fs.bos.crc32c.checksum.enable	Optional: Enable CRC32C checksum verification. When enabled, CRC32C will be computed for uploaded files.
fs.bos.bucket.hierarchy	Optional: Check if the bucket is of namespace type. Correct settings can save a verification step, while incorrect ones will cause bucket access failures. If set to the default, this is determined automatically.
fs.bos.rename.enable	Optional: Enable metadata rename semantics. Currently, during the small-scale closed beta testing phase, a ticket must be issued to confirm usage; otherwise, requests will return a 403 authorization failure. The default is false.

If using EnvironmentVariableCredentialsProvider, you need to set environment variables:

                Bash
                
            

                # Required; accessKey for BOS.
FS_BOS_ACCESS_KEY
# Required; secretKey for BOS.
FS_BOS_SECRET_ACCESS_KEY
# Optional, temporary access mode
FS_BOS_SESSION_TOKEN_KEY
            

Start using

When accessing BOS services using BOS HDFS, the path must start with bos://. For example:

                Bash
                
                $ hdfs dfs -ls bos://{bucket}/
$ hdfs dfs -put ${local_file} bos://{bucket}/a/b/c

Alternatively, configure the default file directory under $hadoop_dir/etc/core-site.xml

                XML
                
                <property>
  <name>fs.defaultFS</name>
  <value>bos://{bucket}</value>
</property>

Note: When fs.defaultFS is set to BosFileSystem, starting NameNode and DataNode might result in a scheme check failure.

It is advised to configure fs.defaultFS only if you're using BosFileSystem and don't need to start NameNode and DataNode. Otherwise, set the default HDFS address instead.

Just like using native HDFS:

Bash

1$ hdfs dfs -ls /

A practice of wordcount

1. Create data directory

Used to save input files for MapReduce tasks

Bash

1$ hdfs dfs -mkdir -p bos://test-bucket/data/wordcount

Used to save output files for MapReduce tasks

Bash

1$ hdfs dfs -mkdir bos://test-bucket/output

View the two newly created directories

Bash

1$ hdfs dfs -ls bos://test-bucket/

                Bash
                
                Found 2 items
drwxrwxrwx   -          0 1970-01-01 08:00 bos://test-bucket/data
drwxrwxrwx   -          0 1970-01-01 08:00 bos://test-bucket/output

If you want to display the accurate creation time of folders, you can configure it under $hadoop_dir/etc/core-site.xml

                XML
                
                <property>
  <name>fs.bos.object.dir.showtime</name>
  <value>true</value>
</property>

2. Write a word file and upload it to hdfs

Content of the word file

Bash

1$ cat words.txt

Bash

1hello baidu
2hello bos
3hadoop hdfs
4hello hadoop
5bos hdfs

Upload words.txt to hdfs

Bash

1$ hdfs dfs -put words.txt bos://test-bucket/data/wordcount

Check the uploaded file in hdfs

Bash

1$ hdfs dfs -cat bos://test-bucket/data/wordcount/words.txt

Bash

1hello baidu
2hello bos
3hadoop hdfs
4hello hadoop
5bos hdfs

3. Run the wordcount program

The built-in wordcount program of Hadoop is located in $hadoop_dir/share/hadoop/mapreduce/.

Bash

1$ hadoop jar hadoop-mapreduce-examples-2.7.7.jar wordcount bos://test-bucket/data/wordcount bos://test-bucket/output/wordcount

4. View statistical results

                Bash
                
                $ hdfs dfs -ls bos://test-bucket/output/wordcount/
-rw-rw-rw-   1          0 2020-06-12 16:55 bos://test-bucket/output/wordcount/_SUCCESS
-rw-rw-rw-   1         61 2020-06-12 16:55 bos://test-bucket/output/wordcount/part-r-00000

                Bash
                
            

                $ hdfs dfs -cat bos://test-bucket/output/wordcount/part-r-00000
baidu	1
bos	    2
hadoop	2
hdfs	2
hello	3
            

Usage of Hadoop Distcp

DistCp (distributed copy) is a built-in Hadoop tool designed for large-scale data transfers between and within clusters. It uses Map/Reduce for file distribution, error handling, recovery, and report generation. DistCp processes a list of files and directories as input and assigns map tasks, with each task handling a portion of the files from the source list.

For copying data between HDFS clusters and BOS, you can use the BOS HDFS tool, just as you would with the standard Hadoop Distcp.

Preparations

As explained above, download the BOS HDFS-related JAR files and make the necessary configurations.

Using an example where the source is an HDFS cluster and the destination is BOS, verify that both the source and destination are accessible for reading and writing as expected.

Bash

1$ hadoop fs -get hdfs://host:port/xxx ./
2$ hadoop fs -put xxx bos://bucket/xxx

Start copying

Ordinary copy

Bash

1# Copy hdfs from hdfs to the dst path under the specified BOS bucket. By default, existing target files will be skipped
2$ hadoop distcp hdfs://host:port/src bos://bucket/dst

Note: To perform CRC verification between source and destination data, BOS HDFS must set fs.bos.block.size to match the source HDFS and enable fs.bos.crc32c.checksum.enable. Only the dfs.checksum.combine.mode=COMPOSITE_CRC verification algorithm of HDFS is supported.

Update and overwrite

Bash

1# Copy, the only standard for overwriting is whether the source and target file sizes are the same. If they differ, the source file replaces the target file
2$ hadoop distcp -update hdfs://host:port/src bos://bucket/dst
3# Copy and overwrite existing target files
4$ hadoop distcp -overwrite hdfs://host:port/src bos://bucket/dst

Copy multiple sources

                Bash
                
            

                # Specify multiple source paths
$ hadoop distcp hdfs://host:port/src1 hdfs://host:port/src2 bos://bucket/dst
# Obtain multiple sources from a file
# The content of srcList consists of multi-line source paths like hdfs://host:port/src1, hdfs://host:port/src2
$ hadoop distcp hdfs://host:port/srcList bos://bucket/dst
            

Note: Copying data from multiple sources may lead to conflicts; consult the conflict handling guidelines for standard Hadoop Distcp.

More configuration

                Bash
                
            

                # Specify the count of maps for data copying
# Increasing the count of maps may not improve data throughput but could lead to some issues. The count of maps should be set based on cluster resources and the scale of copied data
$ hadoop distcp -m 10 hdfs://host:port/src bos://bucket/dst
# Ignore failed maps but retain failure operation logs
$ hadoop distcp -i hdfs://host:port/src bos://bucket/dst
# Dynamic task allocation
# The default allocation strategy is based on file size. During update copying, since some files are skipped, there is a problem of “data skew” during the actual copy operation, and the map with the longest latency slows down overall progress
$ hadoop distcp -strategy dynamic -update hdfs://host:port/src bos://bucket/dst
            

For additional configuration parameters, run the Hadoop Distcp command to access help information.

Advanced usage

Due to the limited scalability of self-built hadoop clusters and the need for a lot of manpower to maintain the cluster, if you have higher requirements for performance and security, it is recommended to use Baidu MapReduce (BMR) provided by Baidu AI Cloud. BMR is a fully managed Hadoop/Spark cluster that can be deployed and elastically scaled on demand. You only need to focus on big data processing, analysis and reporting. The Baidu O&M team, with years of experience in large-scale distributed computing technology, fully manages cluster operation and maintenance, significantly improving performance, security and convenience.

Overview

Big Data Component Usage Guide