fsspec-bosfs
Overview
fsspec-bosfs (referred to as bosfs) is a Python API-based file system implementation for seamless access to Baidu AI Cloud's Object Storage (BOS). bosfs allows users to manage BOS-stored data via the fsspec standard API.
Install
Install pyftpdlib using pip
$ pip install bosfs
Check if the installation is successful
$ pip show bosfs
Quick start
Configure access credentials
There are two methods, with the following priority:
-
Specify it via parameters when initializing BOSFileSystem
Shell1import bosfs 2fs = bosfs.BOSFileSystem(endpoint='http://bj.bcebos.com', access_key='{your ak}', secret_key='{your sk}', sts_token=None) -
Configure environment variables
Shell1export BCE_ACCESS_KEY_ID=xxx 2export BCE_SECRET_ACCESS_KEY=xxx 3export BOS_ENDPOINT=xxx
Example: Basic read/write
-
List data on BOS using bosfs
Shell1import bosfs 2fs = bosfs.BOSFileSystem(endpoint='http://bj.bcebos.com', access_key='xxx', secret_key='xxx') 3res = fs.ls('/mybucket/') 4print(res) -
Read data on BOS using bosfs
Shell1import bosfs 2fs = bosfs.BOSFileSystem(endpoint='http://bj.bcebos.com', access_key='xxx', secret_key='xxx') 3with fs.open('/mybucket/README.md') as f: 4 print(f.readline()) -
Write data to BOS via bosfs
Shell1import bosfs 2import os 3fs = bosfs.BOSFileSystem(endpoint='http://bj.bcebos.com', access_key='xxx', secret_key='xxx') 4object_name = "file1" 5data = os.urandom(10 * 2**20) 6with bosfs.open(object_name, "wb") as f_wb: 7 f_wb.write(data)
For more usage methods, refer to the fsspec document.
Example: Ray
Ray is an open-source distributed Python database that enables users to seamlessly distribute their programs. PyTorch/TensorFlow trainers utilizing Ray leverage Ray's capabilities for data preprocessing, distributed training on heterogeneous clusters, and parallel data loading, preprocessing, and training using Ray Data.
Through Ray, BOS data can be accessed and processed directly via the fsspec API.
- Generate two csv files and upload them to bos
1Name,Age,Gender,Grade
2Alice,14,F,88
3Bob,15,M,92
4Charlie,14,M,85
5David,15,M,90
6Eva,14,F,95
1Name,Age,Gender,Grade
2Fiona,15,F,89
3George,14,M,87
4Hannah,15,F,91
5Ian,14,M,84
6Julia,15,F,93
- Use ray.data to read CSV files
1import bosfs
2import ray
3import pyarrow
4from pyarrow._fs import PyFileSystem
5fs = bosfs.BOSFileSystem(endpoint='http://bj.bcebos.com', access_key='xxx', secret_key='xxx')
6# Read multiple CSV files under the csv_dir/ directory of mybucket
7ds = ray.data.read_csv(paths="bos://mybucket/csv_dir/", filesystem=fs)
8# Output the dataset’s schema
9print(ds.schema())
10# Number of records in the output dataset
11print(ds.count())
12# Output the first 3 lines
13ds.show(3)
- Output
12025-04-03 21:43:28,551 INFO worker.py:1752 -- Started a local Ray instance.
2Column Type
3------ ----
4Name string
5Age int64
6Gender string
7Grade int64
810
9'Name': 'Alice', 'Age': 14, 'Gender': 'F', 'Grade': 88}
10{'Name': 'Bob', 'Age': 15, 'Gender': 'M', 'Grade': 92}
11{'Name': 'Charlie', 'Age': 14, 'Gender': 'M', 'Grade': 85}
Example: PyArrow
PyArrow is a Python implementation of the Apache Arrow library that bridges applications and data using a high-performance memory columnar storage format. It supports efficient data exchange across different data processing systems. By utilizing fsspec for its storage API, PyArrow can interface with a variety of storage backends through a unified API.
Additionally, PyArrow can easily access BOS through bosfs.
Example 1:
1import bosfs
2import pyarrow
3from pyarrow._fs import PyFileSystem
4import pyarrow.dataset as ads
5fs = bosfs.BOSFileSystem(endpoint='http://bj.bcebos.com', access_key='xxx', secret_key='xxx')
6# Read the student.parquet file from mybucket
7ds = ads.dataset("bos://mybucket/student.parquet", filesystem=fs)
8# Output the dataset’s schema
9print(ds.schema)
10# Number of records in the output dataset
11print(ds.count_rows())
12# Output the first record
13print(ds.take([0]))
Example 2:
1import bosfs
2import pyarrow
3from pyarrow._fs import PyFileSystem
4from pyarrow.fs import FSSpecHandler
5fs = bosfs.BOSFileSystem(endpoint='http://bj.bcebos.com', access_key='xxx', secret_key='xxx')
6# Generate a PyArrow filesystem
7py_fs = PyFileSystem(FSSpecHandler(fs))
8# Read BOS data via PyArrow filesystem API
9with py_fs.open_input_file("bos://mybucket/student.parquet") as f:
10 f.read()
Read/write performance
Test environment
- Test machine: 16 cores, 32GB memory, 3Gbps network bandwidth
Performance test results
Single-file sequential read/write performance
| file_size_mb | write_speed_mbps | read_speed_mbps | write_time_s | read_time_s |
|---|---|---|---|---|
| 1 | 9.672178 | 4.640667 | 0.104893 | 0.216561 |
| 4 | 24.867180 | 15.987999 | 0.160897 | 0.250713 |
| 10 | 30.681564 | 30.890233 | 0.343196 | 0.325457 |
| 128 | 75.634190 | 93.547005 | 1.692568 | 1.369047 |
| 512 | 80.359467 | 97.840121 | 6.371391 | 5.233074 |
Multi-file concurrent read/write performance
| file_nums | file_size_mb | threads_nums | write_speed_mbps | read_speed_mbps | write_time_s | read_time_s |
|---|---|---|---|---|---|---|
| 1000 | 4 | 16 | 381.91 | 301.56 | 10.47 | 13.26 |
Usage restrictions
- Currently, bosfs does not support async mode
