Overview
What is the BOS HDFS tool
The BOS HDFS Tool, developed by Baidu AI Cloud, is a convenient utility built on the Hadoop framework, designed specifically to address data reading, writing, and usage issues in BOS within big data scenarios.
Data analysis in big data use cases has become a core focus for enterprises. Hadoop, with its excellent distributed data processing capabilities, has become one of the most widely used open-source big data frameworks today, known for its reliability, efficiency, scalability, and concurrent processing abilities. Its Hadoop Distributed File System (HDFS) offers high fault tolerance and supports high-throughput data access, making it suitable for ultra-large dataset business scenarios. HDFS reliably stores massive amounts of data and is the cornerstone of the Hadoop ecosystem. However, as data volumes swell, native Hadoop faces new challenges, including high costs for building and maintaining HDFS infrastructures and difficulties in storing vast amounts of data locally. With more enterprises migrating their data to the cloud, object storage servers have emerged as a solution. Yet limitations in object storage APIs have created bottlenecks in data access and transfers between object storage and self-hosted HDFS, particularly in big data scenarios. BOS HDFS effectively resolves these challenges.
BOS HDFS is fully compatible with Hadoop versions 2.7+ and 3.1+, supporting large-scale storage of HDFS data in BOS. It uses standard HDFS interfaces for upper-layer data operations like access and read/write, effectively addressing the high operational costs and limited scalability of self-built HDFS. By employing this tool, users can fully leverage BOS’s ultra-low cost, high performance, reliability, and high throughput, meeting enterprise requirements for data read/write and usage in big data scenarios.
Advantages of the BOS HDFS tool
- Framework compatibility: Fully compatible with Hadoop 2.7+/3.1+
- Seamless call: Realize transparent call of data in BOS
- Cost-effective data storage: Combine the ultra-low cost, ultra-high performance, high reliability, high availability, and high throughput advantages of Baidu AI Cloud Object Storage (BOS).
Update records
【1.0.6】
- By default, it is converted to the virtual-host domain name style (the original path-style domain name may trigger a 403 error)
【1.0.5】
- Support the append/truncate interface
- Support multi-bucket isolation configuration for ak/sk/endpoint
- Support EnvironmentVariableCredentialsProvider
- Optimize the getFileStatus/create/delete API
- Optimize the isFile/isDirectory API
- Update the hadoop-common dependency to version 3.2.2
【1.0.4】
- Support CRC32C Checksum verification
- Optimize the create API
- Optimize the open API
- Optimize the hierarchical rename API
- Adjust the default part size (10 MB->12 MB)
- Optimize sequential read policies
【1.0.3】
- Support the hierarchical bucket
- Read cache optimization, enabled by default
- Optimize sequential read
- Optimize multi-file deletion
- Fixed known issues
