Hive Usage Guide
Hive
Hive is a data warehousing tool built on Hadoop, designed for data extraction, transformation, and loading. It provides a mechanism to store, query, and analyze large-scale data stored in Hadoop. Hive can map structured data files to database tables and offers SQL query capabilities, transforming SQL statements into MapReduce tasks for execution.
Prerequisites
First, refer to the article BOS HDFS to install and configure BOS HDFS. The Hadoop version installed on the local machine is hadoop-3.3.2. Refer to the “Getting Started” section in the article to complete the basic trial of BOS HDFS and set environment variables:
1export HADOOP_HOME=/opt/hadoop-3.3.2
2export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath
Install MySQL
MySQL serves as the storage for Hive's metadata. You can either install MySQL locally or connect remotely to an existing MySQL or RDS instance. The local version used is mysql-5.1.61-4.el6.x86_64. After installation, you can check its operation status using the command service mysqld status and configure it as required.
1/usr/bin/mysqladmin -u root -h ${IP} password ${new-password} #Set a new password
You can create a dedicated user for Hive in MySQL and set a unique password.
Install Hive
The version installed on the local machine is 2.3.9. Modify two configurations in the conf folder:
1mv hive-env.sh.template hive-env.sh
2mv hive-site.xml.template hive-site.xml
Add the following content to hive-env.sh:
1export HIVE_CONF_DIR=/ssd2/apache-hive-2.3.9-bin/conf
Add the following content to hive-site.xml:
1<property>
2 <name>javax.jdo.option.ConnectionURL</name>
3 <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
4 <description>MySQL</description>
5</property>
6<property>
7 <name>javax.jdo.option.ConnectionDriverName</name>
8 <value>com.mysql.jdbc.Driver</value>
9 <description>JDBC</description>
10</property>
11<property>
12 <name>javax.jdo.option.ConnectionUserName</name>
13 <value>root</value>
14 <description>username</description>
15 </property>
16<property>
17 <name>javax.jdo.option.ConnectionPassword</name>
18 <value>new-password</value>
19 <description>passward</description>
20</property>
In the configuration, javax.jdo.option.ConnectionURL refers to the connection address of the MySQL server, javax.jdo.option.ConnectionUserName is the MySQL user name used for Hive, and javax.jdo.option.ConnectionPassword is the password corresponding to the user name. After the settings are completed, copy the MySQL JDBC driver to the lib folder. The driver used on the local machine is mysql-connector-java-5.1.32-bin.jar. Initialize MySQL
1./bin/schematool -dbType mysql -initSchema
Start Hive
1./bin/hive
Hive testing
Create a table
1create database hive; // Create a database
2 create table hive_test (a int, b string) //Create a table
3ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
- Create a new shell script named gen_data.sh
1#!/bin/bash
2 MAXROW=1000000 #Specify the number of data rows to generate
3for((i = 0; i < $MAXROW; i++))
4do
5 echo $RANDOM, \"$RANDOM\"
6done
- Run the script to generate test data
1./gen_data.sh > hive_test.data
- Load the data into the table
1load data inpath "bos://${bucket_name}/hive_test.data" into table hive.hive_test;
- Query
1hive> select count(*) from hive_test;
2WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
3Query ID = root_20230528173013_6f5296db-562e-4342-917f-bcf14fc1480d
4Total jobs = 1
5Launching Job 1 out of 1
6Number of reduce tasks determined at compile time: 1
7In order to change the average load for a reducer (in bytes):
8 set hive.exec.reducers.bytes.per.reducer=<number>
9In order to limit the maximum number of reducers:
10 set hive.exec.reducers.max=<number>
11In order to set a constant number of reducers:
12 set mapreduce.job.reduces=<number>
13Job running in-process (local Hadoop)
142023-05-28 17:30:16,548 Stage-1 map = 0%, reduce = 0%
152023-05-28 17:30:18,558 Stage-1 map = 100%, reduce = 100%
16Ended Job = job_local238749048_0001
17MapReduce Jobs Launched:
18Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS
19Total MapReduce CPU Time Spent: 0 msec
20OK
211000000
22hive> select * from hive_test limit 10;
23OK
2411027 "11345"
2510227 "24281"
2632535 "16409"
2724286 "24435"
282498 "10969"
2916662 "16163"
305345 "26005"
3121407 "5365"
3230608 "4588"
3319686 "11831"
