BOS Alluxio Extension
What is Alluxio
Alluxio It is an open source memory-based distributed storage system (a memory speed virtual distributed storage system).
In the big data ecosystem, Alluxio is located between data-driven frameworks or applications (such as Apache Spark, etc.) and various persistent storage systems (such as BOS, HDFS, S3, etc.). Alluxio unifies the data stored in these different storage systems and provides a unified client API and global namespace for its upper data-driven applications.
Advantages of Using BOS through Alluxio
Alluxio's memory-first hierarchical architecture makes it easier to access BOS data. For frequently accessed data, it greatly reduces the number of requests to the BOS interface.
Read
- Memory-level I/O throughput rate, Alluxio's hierarchical storage mechanism can make full use of cached frequently accessed data
- Effectively reduce the latency of some operations in BOS (such as listing directories and renaming)
- Simplify data management, Alluxio supports single-point access to multi-source data
- Compatibility, existing data analysis applications, such as Spark and MapReduce, can run on Alluxio without changing any code
Write
Smart cache, write policy can be configured as needed:
- MUST_CACHE: Write only cache, suitable for temporary data that does not need to save. In spite of the high data loss, the performance is the best
- THROUGH: Data that will not be used in the near future can be directly written into BOS, leaving more memory space in Alluxio for other frequently read data
- CACHE_THROUGH: Write Alluxio and BOS simultaneously, and the data will soon be used by other Alluxio applications
- ASYNC_THROUGH: The default mode, write cache, asynchronous write BOS, suitable for data that needs to be persisted and does not need to be used immediately
Quick Start
Deploy
- Download Alluxio, pre-compiled BOS underlying storage [alluxio-underfs-bos](https://sdk.bce.baidu.com/console-sdk/alluxio -underfs-bos-0.1.1.jar.zip) and unzip.
- Install the jar package of alluxio-underfs-bos.
$ cd {ALLUXIO_HOME}
$ ./bin/alluxio extensions install <path>/alluxio-underfs-bos-0.1.0.jar
- In the ${ALLUXIO_HOME}/conf directory, create a conf/alluxio-site.properties configuration file based on the template file, configure the AK/SK of BOS, and support temporary authorization to access STS.
$ cp conf/alluxio-site.properties.template conf/alluxio-site.properties
fs.bos.accessKey=<BOS_AK>
fs.bos.secretKey=<BOS_SK>
fs.bos.endpoint=<BOS_ENDPOINT>
alluxio.user.file.writetype.default=CACHE_THROUGH
Alluxio mount BOS
- Format Alluxio's log and worker storage directories.
$ ./bin/alluxio format
- Start Alluxio on localhost.
$ ./bin/alluxio-start.sh local
- Create a directory and mount BOS. A Bucket is required on the BOS. Here, the "test-979" Bucket is taken as an example.
$ ./bin/alluxio fs mkdir /mnt
Successfully created directory /mnt
$ ./bin/alluxio fs mount /mnt/bos bos://test-979
Mounted bos://test-979 at /mnt/bos
- Copy local files to Alluxio.
$ ./bin/alluxio fs copyFromLocal LICENSE /mnt/bos
Copied file:///alluxio-2.0.0/LICENSE to /mnt/bos
- Use the list command to view the file just copied, and the meaning of the metrics displayed: Privilege, file size, whether it is persisted to BOS, creation date, cache percentage of this file in Alluxio, file name.
$ ./bin/alluxio fs ls /mnt/bos/LICENSE
-rwxrwxrwx 27040 PERSISTED 07-21-2020 15:06:46:000 100% /mnt/bos/LICENSE
- Stop Alluxio
$ ./bin/alluxio-stop.sh local
Experience Alluxio to Accelerate Access to BOS Data
Alluxio utilizes memory to store data, through which BOS data can be accessed faster. Experience now:
- list View /mnt/bos List of files in the directory, 0% means the file is not in Alluxio memory.
$ time ./bin/alluxio fs ls /mnt/bos
-rwxrwxrwx 27040 PERSISTED 07-21-2020 15:06:46:000 0% /mnt/bos/LICENSE
-rwxrwxrwx 51307896 PERSISTED 07-21-2020 15:05:49:000 0% /mnt/bos/alluxio-underfs-bos-0.1.0.jar
real 0m2.297s
user 0m2.703s
sys 0m0.269s
- Count the number of times the word "the" appears in the file /mnt/bos/LICENSE.
$ time ./bin/alluxio fs cat /mnt/bos/LICENSE | grep -c the
200
real 0m3.357s
user 0m2.974s
sys 0m0.289s
- The BOS data that has been read will be stored in the memory and can be accessed faster when read again. 100% means that the file has been loaded into Alluxio memory.
$ ./bin/alluxio fs ls //mnt/bos/LICENSE
-rwxrwxrwx 27040 PERSISTED 07-21-2020 15:06:46:000 100% /mnt/bos/LICENSE
- Count the number of times the word "the" appears in the file /mnt/bos/LICENSE again, 3.357s → 2.189s; it is not needed to pull data from BOS in the second time, and the time is significantly reduced.
$ time ./bin/alluxio fs cat /mnt/bos/LICENSE | grep -c the
200
real 0m2.189s
user 0m2.835s
sys 0m0.286s
- List again to view the file list in the /mnt/bos directory, 2.297s → 1.793s, it will also be faster.
$ time ./bin/alluxio fs ls /mnt/bos
-rwxrwxrwx 27040 PERSISTED 07-21-2020 15:06:46:000 100% /mnt/bos/LICENSE
-rwxrwxrwx 51307896 PERSISTED 07-21-2020 15:05:49:000 0% /mnt/bos/alluxio-underfs-bos-0.1.0.jar
real 0m1.793s
user 0m2.630s
sys 0m0.238s