百度智能云

All Product Document

          MapReduce

          Timed Analysis on Log Data

          Overview

          Scheduled task can help you with timed activation of the cluster step and is suitable for obtaining the outcome of regular and timed analysis on existing data. Advantages are as follows:

          • Save time: One-time creation of task, and one-key start/stop.
          • Save money: Cluster start/stop based on the task.
          • Save effort: Hosting-type cluster, with no need for deployment and operations OPS.

          This document introduces the realization process of the timed task. You can run the MapReduce step via timed activation of the cluster. In this way, you analyze website logs to conduct statistics of visitor traffic every day.

          TimedTask_Overview_001_en.png

          Example Log

          Example log is Nginx log and stored in a publicly readable path of Baidu Object Storage (BOS):

          • The sample data stored in the “North China - Beijing” region is only available for BMR clusters in the North China region. The path is as follows:

            • bos://datamart-bj/access-log/201701102000/access.log
            • bos://datamart-bj/access-log/201701112000/access.log
            • bos://datamart-bj/access-log/201701122000/access.log
            • bos://datamart-bj/access-log/201701132000/access.log
            • bos://datamart-bj/access-log/201701142000/access.log
          • The sample data stored in the “South China - Guangzhou” region is only available for BMR clusters in the South China region. The path is as follows:

            • bos://datamart-gz/access-log/201701102000/access.log
            • bos://datamart-gz/access-log/201701112000/access.log
            • bos://datamart-gz/access-log/201701122000/access.log
            • bos://datamart-gz/access-log/201701132000/access.log
            • bos://datamart-gz/access-log/201701142000/access.log

          For the region description of Baidu AI Cloud, please see Introduction of Region Selection.

          Design Steps

          1. Write a step program. The code of MapReduce sample program in this document is uploaded to https://github.com/BCEBIGDATA/bmr-sample-java, and you can clone code through GitHub to design local program.
          2. Compile the program to generate jar package. For more information, please see the Maven project compiling.
          3. Upload compiled jar package to Baidu Object Storage (BOS). For more operating information, please see Baidu Object Storage (BOS) Start Guide).

          Store Logs

          1. Time policy schedule: From January 10 to January 14 of 2017, analyze log data of the last day at 20:00 every day.
          2. Prepare log data. You can directly use Example Log provided by Baidu AI Cloud. After knowing the timed task, you can refer to Data Preparation to select your log data.

          Activate Scheduled Tasks

          Create Cluster Templates

          1. Log in to the console, select "Product Service->Baidu MapReduce BMR", and click "Cluster Template" to enter the template list page.
          2. Click "Create Cluster", and configure the following in "Basic Cluster Settings" section:

            • Cluster template: Enter the template name “timedtask”.
            • Log: Select the storage path of the cluster log.
            • Advanced settings: Enable "Automatic Termination".
          3. In "Cluster Configuration" section, select image version BMR 1.0.0(hadoop 2.7), and select template “hadoop”.
          4. Keep other default setups, and click "Finish".
          5. Click the created cluster template to view template details as follows:

          image.png

          Create Scheduled Tasks

          1. In "Product Service->Baidu MapReduce BMR" page, click "Scheduled Task" to enter the timed task list page.
          2. Click "Create Timed Task", enter task name in "Task Parameters" section, and select created cluster template “timedtask”.
          3. In "Execution Frequency" section, set execution frequency to be “every 1 day”, and specify task start time to be “20:00:00 of 2017-01-10” and task end time to be “20:00:00 of 2017-01-10”.
          4. In "Step Setup" section, click "Add Step", and configure the following:

            • Step type: Select “java step”.
            • Step name: Enter “timedtaskjob”.
            • Application location: If using self-complied program, you need to upload program jar package to BOS or your local HDFS, and enter program path there; you can directly use sample program provided by Baidu AI Cloud, and the path is as follows:

              • Path of sample program for BMR cluster in North China - Beijing region: bos://bmr-public-bj/sample/mapreduce-1.0-SNAPSHOT.jar.
              • Path of sample program for BMR cluster in South China - Guangzhou region: bos://bmr-public-gz/sample/mapreduce-1.0-SNAPSHOT.jar.
            • Action after failure: Continue.
            • MainClass: Enter “com.baidu.cloud.bmr.mapreduce.AccessLogAnalyzer”.
            • Application parameters: Specify the input data path and output path (BOS or HDFS), and the output path must have write permission and not be repeated. For example, if a sample log is used as input data and BOS is used as an output path, the input is as follows:

              • Parameters for BMR cluster in North China - Beijing region: bos://datamart-bj/access-log/201701102000/access.log bos://{your-bucket}/output/%Y%m%d%H%M.
              • Parameters for BMR cluster in South China - Guangzhou region: bos://datamart-gz/access-log/201701102000/access.log bos://{your-bucket}/output/%Y%m%d%H%M.

              Notes:

              • Please replace {your-bucket} with your bucket name.
              • When activating cluster, the system can automatically match the file that corresponds to string “%Y%m%d%H%M” according to input/output address. For example, when the cluster is activated at 20:00 of November 28, 2017, the data with address bos://datamart-bj/access-log/201701102000/access.log are called, and the operating outcome is automatically output to folder with address bos://{your-bucket}/output/201711282000/.
          5. Click "OK" to complete the addition of a timed task.
          6. Click "Finish" to complete the creation of timed task.
          7. You can view the created task in "Product Service>MapReduce>Baidu MapReduce-Timed Task" page, as the following figure shows:

          image.png

          Read Analysis Results

          From January 10 to January 14 of 2017, the system automatically activates cluster and runs step at 20:00 every day, and the cluster is automatically released after the step is completed. You can view the execution outcome of every task under the path bos://{your-bucket}/output/. The following figure shows the analysis result of the first task:

          Previous
          Operation Guide
          Next
          Use Hive to Analyze Website Logs