Application scenarios
Scenario 1: Big data computing scenario
In traditional big data environments, storage and computation are often integrated using an HDFS-based architecture. However, migrating this architecture to the cloud may introduce various challenges.
-
It is difficult to balance cost and operational complexity:
- Building HDFS on cloud disks results in an actual replication factor of 3*3=9, incurring relatively high costs;
- To save costs, local disk setup is required, with no reduction in operational complexity compared to offline IDCs;
-
Storage and computing resources are coupled and cannot be expanded separately:
- This can easily result in low utilization of either type of resource;
- The elasticity may be insufficient when greater computing power is needed;
When using big data services in the cloud, relying solely on Baidu AI Cloud Object Storage (BOS) can address the issues of high costs and low scalability. However, simply simulating file system operations based on object storage still presents the following problems:
- Simulating hierarchical directories with flat directories introduces redundant operations, resulting in poor metadata performance;
- Data plane access latency is one order of magnitude higher than that of HDFS, and object storage bandwidth-limiting mechanisms further constrain performance;
- The compatibility with Hadoop is just ordinary, requiring special handling in certain scenarios, for example, rename does not support atomicity
To overcome these issues, adopting a compute-storage decoupled architecture based on BOS and RapidFS can accelerate the access of big data compute nodes to storage resources while reducing costs by building a low-cost storage system.
Scenario 2: AI training scenario
AI scenarios have posed the following storage requirements:
- POSIX compatibility is a rigid demand: Scientists and algorithm engineers are more familiar with POSIX interfaces, and POSIX interfaces also provide the best support for mainstream frameworks and various software;
- Training performance must be guaranteed: Maximizing GPU utilization is one of the most critical issues, with List and read IO performance being the most important
- Seamless integration with schedulers: Data Sync should ideally be automated, shielding users from underlying storage details to reduce usage costs
The above requirements cannot be well satisfied under traditional object storage solutions:
- Flat directory emulation exhibits poor POSIX compatibility and performance, which aligns with the challenges faced in big data storage-compute separation scenarios;
- Object storage can satisfy the high-throughput performance demands of AI training, but scenarios with massive small files in AI training are quite common, and the read latency when directly based on object storage is not ideal;
- The FUSE and CSI plugins of object storage only satisfy lightweight usage and are not integrated with schedulers. Many data sync details still require customer attention;
In AI training for HPC scenarios, you can employ a compute-storage separation architecture based on Baidu AI Cloud Object Storage (BOS) and RapidFS. This approach enhances storage resource access efficiency, facilitates faster AI model training, and reduces overall storage costs.
