Hive

Last Updated：2020-10-21

Hive Introduction

The Weblog analysis for statistics on the top 5 hours of user visits is taken as an example to introduce how to use Hive on the Baidu AI Cloud platform in this document. Baidu MapReduce provides two ways to use Hive:

For Hive usage through Hue, please see Hue.
Submit Hive Step in Console.

Hive is a data warehouse tool based on Hadoop. It can read, write, administer, and analyze massive data and has storage ability and computing ability, which are easy to expand. Without developing a unique MapReduce application, you can use SQL-like statements to realize the easy MapReduce statistics quickly. The developer familiar with MapReduce can develop custom Mapper and Reducer to handle the complex analysis.

Hive’s Data Units:

Databases: The databases.
Tables: The tables.
Partitions: The partitions. Only fixed partitions are supported, and the same group of data is stored in a fixed partition.
Buckets (or Clusters): The bucketing. The data in a partition can be subdivided, and the same KEYs are divided into a bucket.

Hive’s Data Types:

Original data types
- TINYINT, which only takes 1 byte and stores an integer from 0 to 255.
- SMALLINT, which takes 2 bytes and stores a number from –32768 to 32767.
- INT, which takes 4 bytes and stores a number from -2147483648 to 2147483647.
- BIGINT, which takes 8 bytes and stores a number from -2^63^ to 2^63^-1.
- BOOLEAN, which is TRUE/FALSE. - FLOAT, which is a single-precision floating-point number.
- DOUBLE, which is a double-precision floating-point number.
- STRING, which has no length setting.
Complex data types
- Structs: A group of structures composed of random types of data.
- Maps: The K-V pair.
- Arrays: The arrays.

Submit Hive Step in Console

Prepare Hive Script

BMR’s hive step script integrates table creation and query statements into the hive script file. You can directly use Sample Script or create your hive script according to the following file content:

DROP TABLE IF EXISTS access_logs;
CREATE EXTERNAL TABLE access_logs (
  remote_addr STRING comment 'client IP',
  time_local STRING comment 'access time',
  request STRING comment 'request URL',
  status STRING comment 'HTTP status',
  body_bytes_sent STRING comment 'size of response body',
  http_referer STRING comment 'referer',
  http_cookie STRING comment 'cookies',
  remote_user STRING comment 'client name',
  http_user_agent STRING comment 'client browser info',
  request_time STRING comment 'consumed time of handling request',
  host STRING comment 'server host',
  msec STRING comment 'consumed time of writing logs'
)
COMMENT 'web access logs'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([0-9\.]+) - \\[([^\\]]+)\\] \"([^\"]*)\" ([\\d]+) ([\\d]*) \"([^\"]*)\" \"([^\"]*)\" ([\\S]+) \"([^\"]*)\" ([0-9\.]+) ([\\S]+) ([0-9\.]+)"
)
STORED AS TEXTFILE
LOCATION "${INPUT}";

INSERT OVERWRITE DIRECTORY '${OUTPUT}'
SELECT hour(from_unixtime(unix_timestamp(time_local, 'dd/MMMM/yyyy:HH:mm:ss Z'))) as hour, count(1) as pv
FROM access_logs
GROUP BY hour(from_unixtime(unix_timestamp(time_local, 'dd/MMMM/yyyy:HH:mm:ss Z')))
ORDER BY pv DESC
limit 5;

Script Introduction

CREATE EXTERNAL TABLE statement defines metadata from files on BOS, and SELECT statement counts PV from the time dimension.
During table creation, the data storage path is specified to be ${INPUT}. This path is the specified BOS input address when you submit Hive steps in the BMR cluster. For more information, please see [Run Hive Steps](#Run Hive Step).
You can view the query result from the step log stdout in the step list of BMR. You can view the result at the BOS output path, which is specified when you submit Hive steps. For more information, please see [Run Hive Steps](#Run Hive Step).

Create BMR Clusters

Prepare the data. For more information, please see Data Preparation.
Prepare Baidu AI Cloud Environment.
Log in to the console, select "Product Service->Baidu MapReduce BMR", and click "Create Cluster" to enter the cluster creation page and configure the following:
- Set cluster name
- Set administrator password
- Disable log
- Select image version “BMR 0.2.0(hadoop 2.6)”
- Select the built-in template “hadoop”.
Keep other default configurations of the cluster, and click "Finish" to view the created cluster in the cluster list page. The cluster is created successfully when cluster status changes from "Initializing" to "Waiting".

Run Hive Step

In "Product Service>MapReduce>Baidu MapReduce-Homework List" page, click "Create Step" to enter the step creation page.
Configure the Hive step parameters as follows:
- Step type: Select “Hive step”.
- Step name: Enter the step name with length not exceeding 255 characters.
- bos script address: Enter bos://bmr-public-data/apps/hive/AccessLogAnalyzer_bmr_0.2.0.hql.
- bos input address: Enter bos://bmr-public-data/logs/.
- bos output address: The output path must be granted the write permission, and the directory specified in the path cannot exist on bos. For example, if the output path is bos://test/sqooptest, the sqooptest directory must not exist on bos. Enter bos://{your-bucket}/output.
- Action after failure: Continue.
- Application parameters: None.
Select the adaptive cluster in the "Cluster Adaption" section.
Click "Finish" to complete the creation of the step. The status changes from "Waiting" to "Running" when the step is running, and then it changes to "Completed" when the step is completed.

View Results

You can view the query result from the output file bos://{your-bucket}/output/000000. Currently, the community hive provided in BMR does not support the custom column field separator when the result is exported to bos or hdfs file, and hive’s default ^A is used as column field separator of the output file. If you use input data and program provided by the system, you can see the following when viewing the output:

21    149290
22    139479
15    131247
20    98391
8     97590

Spark

HBase