fsspec-bosfs

Updated at：2025-11-03

Overview

fsspec-bosfs (referred to as bosfs) is a Python API-based file system implementation for seamless access to Baidu AI Cloud's Object Storage (BOS). bosfs allows users to manage BOS-stored data via the fsspec standard API.

Install

Install pyftpdlib using pip

$ pip install bosfs

Check if the installation is successful

$ pip show bosfs

Quick start

Configure access credentials

There are two methods, with the following priority:

Specify it via parameters when initializing BOSFileSystem

                Shell
                
                import bosfs
fs = bosfs.BOSFileSystem(endpoint='http://bj.bcebos.com', access_key='{your ak}', secret_key='{your sk}', sts_token=None)

Configure environment variables

                Shell
                
                export BCE_ACCESS_KEY_ID=xxx
export BCE_SECRET_ACCESS_KEY=xxx
export BOS_ENDPOINT=xxx

Example: Basic read/write

List data on BOS using bosfs

                Shell
                
                import bosfs
fs = bosfs.BOSFileSystem(endpoint='http://bj.bcebos.com', access_key='xxx', secret_key='xxx')
res = fs.ls('/mybucket/')
print(res)

Read data on BOS using bosfs

                Shell
                
                import bosfs
fs = bosfs.BOSFileSystem(endpoint='http://bj.bcebos.com', access_key='xxx', secret_key='xxx')
with fs.open('/mybucket/README.md') as f:
    print(f.readline())

Write data to BOS via bosfs

                Shell
                
            

                import bosfs
import os
fs = bosfs.BOSFileSystem(endpoint='http://bj.bcebos.com', access_key='xxx', secret_key='xxx')
object_name = "file1"
data = os.urandom(10 * 2**20)
with bosfs.open(object_name, "wb") as f_wb:
    f_wb.write(data)
            

For more usage methods, refer to the fsspec document.

Example: Ray

Ray is an open-source distributed Python database that enables users to seamlessly distribute their programs. PyTorch/TensorFlow trainers utilizing Ray leverage Ray's capabilities for data preprocessing, distributed training on heterogeneous clusters, and parallel data loading, preprocessing, and training using Ray Data.

Through Ray, BOS data can be accessed and processed directly via the fsspec API.

Generate two csv files and upload them to bos

Csv

1Name,Age,Gender,Grade
2Alice,14,F,88
3Bob,15,M,92
4Charlie,14,M,85
5David,15,M,90
6Eva,14,F,95

Csv

1Name,Age,Gender,Grade
2Fiona,15,F,89
3George,14,M,87
4Hannah,15,F,91
5Ian,14,M,84
6Julia,15,F,93

Use ray.data to read CSV files

                Python
                
            

                import bosfs
import ray
import pyarrow
from pyarrow._fs import PyFileSystem
fs = bosfs.BOSFileSystem(endpoint='http://bj.bcebos.com', access_key='xxx', secret_key='xxx')
# Read multiple CSV files under the csv_dir/ directory of mybucket
ds = ray.data.read_csv(paths="bos://mybucket/csv_dir/", filesystem=fs)
# Output the dataset’s schema
print(ds.schema())
# Number of records in the output dataset
print(ds.count())
# Output the first 3 lines
ds.show(3)
            

Output

                Shell
                
            

                2025-04-03 21:43:28,551 INFO worker.py:1752 -- Started a local Ray instance.
Column  Type
------  ----
Name    string
Age     int64
Gender  string
Grade   int64
10
'Name': 'Alice', 'Age': 14, 'Gender': 'F', 'Grade': 88}                                                                                       
{'Name': 'Bob', 'Age': 15, 'Gender': 'M', 'Grade': 92}                                                                                        
{'Name': 'Charlie', 'Age': 14, 'Gender': 'M', 'Grade': 85}
            

Example: PyArrow

PyArrow is a Python implementation of the Apache Arrow library that bridges applications and data using a high-performance memory columnar storage format. It supports efficient data exchange across different data processing systems. By utilizing fsspec for its storage API, PyArrow can interface with a variety of storage backends through a unified API.

Additionally, PyArrow can easily access BOS through bosfs.

Example 1:

                Python
                
            

                import bosfs
import pyarrow
from pyarrow._fs import PyFileSystem
import pyarrow.dataset as ads
fs = bosfs.BOSFileSystem(endpoint='http://bj.bcebos.com', access_key='xxx', secret_key='xxx')
# Read the student.parquet file from mybucket
ds = ads.dataset("bos://mybucket/student.parquet", filesystem=fs)
# Output the dataset’s schema
print(ds.schema)
# Number of records in the output dataset
print(ds.count_rows())
# Output the first record
print(ds.take([0]))
            

Example 2:

                Python
                
            

                import bosfs
import pyarrow
from pyarrow._fs import PyFileSystem
from pyarrow.fs import FSSpecHandler
fs = bosfs.BOSFileSystem(endpoint='http://bj.bcebos.com', access_key='xxx', secret_key='xxx')
# Generate a PyArrow filesystem
py_fs = PyFileSystem(FSSpecHandler(fs))
# Read BOS data via PyArrow filesystem API
with py_fs.open_input_file("bos://mybucket/student.parquet") as f:
    f.read()
            

Read/write performance

Test environment

Test machine: 16 cores, 32GB memory, 3Gbps network bandwidth

Performance test results

Single-file sequential read/write performance

file_size_mb	write_speed_mbps	read_speed_mbps	write_time_s	read_time_s
1	9.672178	4.640667	0.104893	0.216561
4	24.867180	15.987999	0.160897	0.250713
10	30.681564	30.890233	0.343196	0.325457
128	75.634190	93.547005	1.692568	1.369047
512	80.359467	97.840121	6.371391	5.233074

Multi-file concurrent read/write performance

file_nums	file_size_mb	threads_nums	write_speed_mbps	read_speed_mbps	write_time_s	read_time_s
1000	4	16	381.91	301.56	10.47	13.26

Usage restrictions

Currently, bosfs does not support async mode

Data processing

Peripheral Tools Overview