Building a Dataset

Updated at：2025-11-03

bostorchconnector currently supports two types of datasets:

BosMapDataset: Inherits from torch.utils.data.Dataset
BosIterableDataset: Inherits from torch.utils.data.IterableDataset

Both datasets support the following two construction methods:

from_prefix: Constructed through a storage prefix
from_objects: Constructed through a known list of objects

The configuration parameters of bos_client_config when building a dataset are as follows:

ConfigMap	Default value	Description
credentials_path	"~/.baidubce/credentials"	The path to store access credentials ak/sk
log_level	1	The default log level is INFO. The levels are as follows: 0 for DEBUG, 1 for INFO, 2 for WARN, 3 for ERROR, and 4 for FATAL.
log_path	"/tmp/bostorchconnector/sdk.log"	Log storage path
part_size	8388608	Part size for multipart upload, default is 8 MB
prefect_limit_mb	4096	Prefetch cache limit, in MB, default is 4,096 MB

Construct BosMapDataset suitable for random reading

The BosMapDataset is ideal for small datasets that require random access.

Taking construction with from_prefix as an example:

                Bash
                
            

                from bostorchconnector import BosMapDataset
# Fill in <BUCKET>, <PREFIX> and the corresponding endpoint
DATASET_URI="bos://<BUCKET>/<PREFIX>"
ENDPOINT="http://bj.bcebos.com"
config = BosClientConfig(log_level=1)
map_dataset = BosMapDataset.from_prefix(DATASET_URI, endpoint=ENDPOINT, bos_client_config=config)
# Randomly read the sample with index=0
item = map_dataset[0]
# Obtain information such as bucket, key, and data
bucket = item.bucket
key = item.key
content = item.read()
len(content)
            

Taking construction with from_objects as an example:

                Bash
                
            

                from bostorchconnector import BosMapDataset
# Fill in <BUCKET>, ObjectName and the corresponding endpoint
DATASET_URIS = [
    "bos://<BUCKET>/img001.jpg",
    "bos://<BUCKET>/img002.jpg",
    "bos://<BUCKET>/img003.jpg"
]
ENDPOINT="http://bj.bcebos.com"
config = BosClientConfig(log_level=1)
map_dataset = BosMapDataset.from_objects(DATASET_URI, endpoint=ENDPOINT, bos_client_config=config)
# Randomly read the sample with index=0
item = map_dataset[0]
# Obtain information such as bucket, key, and data
bucket = item.bucket
key = item.key
content = item.read()
len(content)
            

Construct BosIterableDataset suitable for streaming sequential reading

The BosIterableDataset is suitable for handling large datasets that need sequential processing.

Taking construction with from_prefix as an example:

                Bash
                
            

                from bostorchconnector import BosIterableDataset
# Fill in <BUCKET>, ObjectName and the corresponding endpoint
DATASET_URI="bos://<BUCKET>/<PREFIX>"
ENDPOINT="http://bj.bcebos.com"
config = BosClientConfig(log_level=1)
iterable_dataset = BosIterableDataset.from_prefix(DATASET_URI, endpoint=ENDPOINT, bos_client_config=config)
# BosIterableDataset is an iterable object
for item in iterable_dataset:
    data = item.read()
    print(len(data))
    print(item.key)
            

Taking construction with from_objects as an example:

                Bash
                
            

                from bostorchconnector import BosIterableDataset
# Fill in <BUCKET>, ObjectName and the corresponding endpoint
DATASET_URIS = [
    "bos://<BUCKET>/img001.jpg",
    "bos://<BUCKET>/img002.jpg",
    "bos://<BUCKET>/img003.jpg"
]
ENDPOINT="http://bj.bcebos.com"
config = BosClientConfig(log_level=1)
iterable_dataset = BosIterableDataset.from_objects(DATASET_URI, endpoint=ENDPOINT, bos_client_config=config)
# BosIterableDataset is an iterable object
for item in iterable_dataset:
    data = item.read()
    print(len(data))
    print(item.key)
            

Installing BOS Connector for Pytorch

Configuring BOS Connector for Pytorch