Pig

Last Updated：2020-07-20

Pig Introduction

The Weblog analysis for statistics on daily PV and UV is taken as an example to introduce how to use Pig on the Baidu AI Cloud platform in this document.

Pig is a large data analysis platform based on Hadoop, and it converts SQL-like data analysis requests to a set of optimized MapReduce operations. Pig is suitable for massive parallel processes and can process large datasets. For the parallel computing of complex, massive data, Pig provides a simple operation and programming interface, which is easy to write and maintain and allows users to create their processes for different purposes.

Pig supports advanced concepts (such as package, tuple, and mapping) and simple data (such as int, long, float, double, chararray, and bytearray).

Program Preparation

You can directly use Sample Program.

lines = LOAD '${INPUT}';  
fields = FOREACH lines GENERATE FLATTEN(REGEX_EXTRACT_ALL($0, '(\\S+)\\s+-\\s+\\[(.*?)\\]\\s+\\"(.*)\\"\\s+(\\d{3})\\s+(.*)\\s+\\"(.*)\\"\\s+\\"(.*)\\"\\s+(.*)\\s+\\"(.*)\\"\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)'));
access_logs = FOREACH fields GENERATE $0 AS remote_ip, ToString(ToDate($1, 'dd/MMM/yyyy:HH:mm:ss Z'), 'dd/MMM/yyyy') AS date;
groups = GROUP access_logs BY date;
pv = FOREACH groups GENERATE group, COUNT(access_logs);
STORE pv INTO '${OUTPUT}';

Create Clusters

Prepare the data. For more information, please see Data Preparation.
Prepare Baidu AI Cloud Environment.
Log in to the console, select "Product Service->Baidu MapReduce BMR", and click "Create Cluster" to enter the cluster creation page and configure the following:
- Set cluster name
- Set administrator password
- Disable log
- Select image version “BMR 1.0.0(hadoop 2.7)”
- Select the built-in template “hadoop”.
Keep other default configurations of the cluster, and click "Finish" to view the created cluster in the cluster list page. The cluster is created successfully when cluster status changes from "Initializing" to "Waiting".

Run Pig Steps

In "Product Service>MapReduce>Baidu MapReduce-Homework List" page, click "Create Step" to enter the step creation page.
Configure the Pig step parameters as follows:
- Step type: Select “Pig step”.
- Step name: Enter the step name with length not exceeding 255 characters.
- bos script address: You can enter sample program path bos://bmr-public-data/apps/pig/AccessLogAnalyzer.pig.
- bos input address: You can enter sample data path bos://bmr-public-data/logs/accesslog-1k.log.
- bos output address: Enter bos://{yourbucket}/output. The path must be granted the write permission, and the directory specified in the path cannot exist on bos. For example, if the output path is bos://test/sqooptest, the sqooptest directory must not exist on bos.
- Action after failure: Continue.
- Application parameters: None.
Select the adaptive cluster in the "Cluster Adaption" section.
Click "Finish" to complete the creation of the step. The status changes from "Waiting" to "Running" when the step is running, and changes to "Completed" when the step is completed.

View Results

You can go to the path bos://{yourbukcet}/output to view the output. If you use input data and program provided by the system, you can see the following result:

03/Oct/2015    139
04/Oct/2015    375
05/Oct/2015    372
06/Oct/2015    114

Sqoop

Hue