百度智能云

All Product Document

          Object Storage

          BOS Alluxio Extension

          What is Alluxio

          Alluxio It is an open source memory-based distributed storage system (a memory speed virtual distributed storage system).

          In the big data ecosystem, Alluxio is located between data-driven frameworks or applications (such as Apache Spark, etc.) and various persistent storage systems (such as BOS, HDFS, S3, etc.). Alluxio unifies the data stored in these different storage systems and provides a unified client API and global namespace for its upper data-driven applications.

          Advantages of Using BOS through Alluxio

          Alluxio's memory-first hierarchical architecture makes it easier to access BOS data. For frequently accessed data, it greatly reduces the number of requests to the BOS interface.

          Read

          • Memory-level I/O throughput rate, Alluxio's hierarchical storage mechanism can make full use of cached frequently accessed data
          • Effectively reduce the latency of some operations in BOS (such as listing directories and renaming)
          • Simplify data management, Alluxio supports single-point access to multi-source data
          • Compatibility, existing data analysis applications, such as Spark and MapReduce, can run on Alluxio without changing any code

          Write

          Smart cache, write policy can be configured as needed:

          • MUST_CACHE: Write only cache, suitable for temporary data that does not need to save. In spite of the high data loss, the performance is the best
          • THROUGH: Data that will not be used in the near future can be directly written into BOS, leaving more memory space in Alluxio for other frequently read data
          • CACHE_THROUGH: Write Alluxio and BOS simultaneously, and the data will soon be used by other Alluxio applications
          • ASYNC_THROUGH: The default mode, write cache, asynchronous write BOS, suitable for data that needs to be persisted and does not need to be used immediately

          Quick Start

          Deploy

          1. Download Alluxio, pre-compiled BOS underlying storage [alluxio-underfs-bos](https://sdk.bce.baidu.com/console-sdk/alluxio -underfs-bos-0.1.1.jar.zip) and unzip.
          2. Install the jar package of alluxio-underfs-bos.
          $ cd {ALLUXIO_HOME} 
          $ ./bin/alluxio extensions install <path>/alluxio-underfs-bos-0.1.0.jar 
          1. In the ${ALLUXIO_HOME}/conf directory, create a conf/alluxio-site.properties configuration file based on the template file, configure the AK/SK of BOS, and support temporary authorization to access STS.
          $ cp conf/alluxio-site.properties.template conf/alluxio-site.properties 
          fs.bos.accessKey=<BOS_AK> 
          fs.bos.secretKey=<BOS_SK> 
          fs.bos.endpoint=<BOS_ENDPOINT> 
          alluxio.user.file.writetype.default=CACHE_THROUGH

          Alluxio mount BOS

          1. Format Alluxio's log and worker storage directories.
          $ ./bin/alluxio format 
          1. Start Alluxio on localhost.
          $ ./bin/alluxio-start.sh local 
          1. Create a directory and mount BOS. A Bucket is required on the BOS. Here, the "test-979" Bucket is taken as an example.
          $ ./bin/alluxio fs mkdir /mnt 
          Successfully created directory /mnt 
          $ ./bin/alluxio fs mount /mnt/bos bos://test-979 
          Mounted bos://test-979 at /mnt/bos 
          1. Copy local files to Alluxio.
          $ ./bin/alluxio fs copyFromLocal LICENSE /mnt/bos 
          Copied file:///alluxio-2.0.0/LICENSE to /mnt/bos 
          1. Use the list command to view the file just copied, and the meaning of the metrics displayed: Privilege, file size, whether it is persisted to BOS, creation date, cache percentage of this file in Alluxio, file name.
          $ ./bin/alluxio fs ls /mnt/bos/LICENSE 
          -rwxrwxrwx        27040        PERSISTED 07-21-2020 15:06:46:000 100% /mnt/bos/LICENSE 
          1. Stop Alluxio
          $ ./bin/alluxio-stop.sh local 

          Experience Alluxio to Accelerate Access to BOS Data

          Alluxio utilizes memory to store data, through which BOS data can be accessed faster. Experience now:

          1. list View /mnt/bos List of files in the directory, 0% means the file is not in Alluxio memory.
          $ time ./bin/alluxio fs ls /mnt/bos 
          -rwxrwxrwx        27040        PERSISTED 07-21-2020 15:06:46:000    0% /mnt/bos/LICENSE 
          -rwxrwxrwx        51307896     PERSISTED 07-21-2020 15:05:49:000    0% /mnt/bos/alluxio-underfs-bos-0.1.0.jar 
          real 	 0m2.297s 
          user 	 0m2.703s 
          sys 	 0m0.269s 
          1. Count the number of times the word "the" appears in the file /mnt/bos/LICENSE.
          $ time ./bin/alluxio fs cat /mnt/bos/LICENSE | grep -c the 
          200
          real 	 0m3.357s 
          user 	 0m2.974s 
          sys 	 0m0.289s 
          1. The BOS data that has been read will be stored in the memory and can be accessed faster when read again. 100% means that the file has been loaded into Alluxio memory.
          $ ./bin/alluxio fs ls //mnt/bos/LICENSE 
          -rwxrwxrwx        27040        PERSISTED 07-21-2020 15:06:46:000 100% /mnt/bos/LICENSE 
          1. Count the number of times the word "the" appears in the file /mnt/bos/LICENSE again, 3.357s → 2.189s; it is not needed to pull data from BOS in the second time, and the time is significantly reduced.
          $ time ./bin/alluxio fs cat /mnt/bos/LICENSE | grep -c the 
          200
          real 	 0m2.189s 
          user 	 0m2.835s 
          sys 	 0m0.286s 
          1. List again to view the file list in the /mnt/bos directory, 2.297s → 1.793s, it will also be faster.
          $ time ./bin/alluxio fs ls /mnt/bos 
          -rwxrwxrwx        27040        PERSISTED 07-21-2020 15:06:46:000 100% /mnt/bos/LICENSE 
          -rwxrwxrwx        51307896     PERSISTED 07-21-2020 15:05:49:000    0% /mnt/bos/alluxio-underfs-bos-0.1.0.jar 
          real 	 0m1.793s 
          user 	 0m2.630s 
          sys 	 0m0.238s 
          Previous
          Third-party Tool
          Next
          Docker