Building a Dataset
Updated at:2025-11-03
bostorchconnector currently supports two types of datasets:
- BosMapDataset: Inherits from torch.utils.data.Dataset
- BosIterableDataset: Inherits from torch.utils.data.IterableDataset
Both datasets support the following two construction methods:
- from_prefix: Constructed through a storage prefix
- from_objects: Constructed through a known list of objects
The configuration parameters of bos_client_config when building a dataset are as follows:
| ConfigMap | Default value | Description |
|---|---|---|
| credentials_path | "~/.baidubce/credentials" | The path to store access credentials ak/sk |
| log_level | 1 | The default log level is INFO. The levels are as follows: 0 for DEBUG, 1 for INFO, 2 for WARN, 3 for ERROR, and 4 for FATAL. |
| log_path | "/tmp/bostorchconnector/sdk.log" | Log storage path |
| part_size | 8388608 | Part size for multipart upload, default is 8 MB |
| prefect_limit_mb | 4096 | Prefetch cache limit, in MB, default is 4,096 MB |
Construct BosMapDataset suitable for random reading
The BosMapDataset is ideal for small datasets that require random access.
Taking construction with from_prefix as an example:
Bash
1from bostorchconnector import BosMapDataset
2# Fill in <BUCKET>, <PREFIX> and the corresponding endpoint
3DATASET_URI="bos://<BUCKET>/<PREFIX>"
4ENDPOINT="http://bj.bcebos.com"
5config = BosClientConfig(log_level=1)
6map_dataset = BosMapDataset.from_prefix(DATASET_URI, endpoint=ENDPOINT, bos_client_config=config)
7# Randomly read the sample with index=0
8item = map_dataset[0]
9# Obtain information such as bucket, key, and data
10bucket = item.bucket
11key = item.key
12content = item.read()
13len(content)
Taking construction with from_objects as an example:
Bash
1from bostorchconnector import BosMapDataset
2# Fill in <BUCKET>, ObjectName and the corresponding endpoint
3DATASET_URIS = [
4 "bos://<BUCKET>/img001.jpg",
5 "bos://<BUCKET>/img002.jpg",
6 "bos://<BUCKET>/img003.jpg"
7]
8ENDPOINT="http://bj.bcebos.com"
9config = BosClientConfig(log_level=1)
10map_dataset = BosMapDataset.from_objects(DATASET_URI, endpoint=ENDPOINT, bos_client_config=config)
11# Randomly read the sample with index=0
12item = map_dataset[0]
13# Obtain information such as bucket, key, and data
14bucket = item.bucket
15key = item.key
16content = item.read()
17len(content)
Construct BosIterableDataset suitable for streaming sequential reading
The BosIterableDataset is suitable for handling large datasets that need sequential processing.
Taking construction with from_prefix as an example:
Bash
1from bostorchconnector import BosIterableDataset
2# Fill in <BUCKET>, ObjectName and the corresponding endpoint
3DATASET_URI="bos://<BUCKET>/<PREFIX>"
4ENDPOINT="http://bj.bcebos.com"
5config = BosClientConfig(log_level=1)
6iterable_dataset = BosIterableDataset.from_prefix(DATASET_URI, endpoint=ENDPOINT, bos_client_config=config)
7# BosIterableDataset is an iterable object
8for item in iterable_dataset:
9 data = item.read()
10 print(len(data))
11 print(item.key)
Taking construction with from_objects as an example:
Bash
1from bostorchconnector import BosIterableDataset
2# Fill in <BUCKET>, ObjectName and the corresponding endpoint
3DATASET_URIS = [
4 "bos://<BUCKET>/img001.jpg",
5 "bos://<BUCKET>/img002.jpg",
6 "bos://<BUCKET>/img003.jpg"
7]
8ENDPOINT="http://bj.bcebos.com"
9config = BosClientConfig(log_level=1)
10iterable_dataset = BosIterableDataset.from_objects(DATASET_URI, endpoint=ENDPOINT, bos_client_config=config)
11# BosIterableDataset is an iterable object
12for item in iterable_dataset:
13 data = item.read()
14 print(len(data))
15 print(item.key)
