Hive Usage Guide
Hive
Hive is a data warehousing tool built on Hadoop, designed for data extraction, transformation, and loading. It provides a mechanism to store, query, and analyze large-scale datasets stored in Hadoop. Hive can map structured data files to database tables and offers SQL query functionality, converting SQL statements into MapReduce tasks for execution.
Prerequisites
First, refer to the document BOS HDFS to install and configure BOS HDFS. The Hadoop version installed on the local machine is hadoop-3.3.2. Refer to the “Getting Started” section in the document to complete the basic trial of BOS HDFS and set environment variables:
1export HADOOP_HOME=/opt/hadoop-3.3.2
2export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath
Install MySQL
MySQL serves as the storage for Hive's metadata. You can either install MySQL locally or connect remotely to an existing MySQL or RDS instance. The local version used is mysql-5.1.61-4.el6.x86_64. After installation, you can check its operation status using the command service mysqld status and proceed with configuration.
1/usr/bin/mysqladmin -u root -h ${IP} password ${new-password} #Set a new password
You can create a dedicated MySQL user for Hive and configure a password for added security.
Install Hive
The version installed on the local machine is 2.3.9. Modify two configurations in the conf folder:
1mv hive-env.sh.template hive-env.sh
2mv hive-site.xml.template hive-site.xml
Add the following content to hive-env.sh:
1export HIVE_CONF_DIR=/ssd2/apache-hive-2.3.9-bin/conf
Add the following content to hive-site.xml:
1<property>
2 <name>javax.jdo.option.ConnectionURL</name>
3 <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
4 <description>MySQL</description>
5</property>
6<property>
7 <name>javax.jdo.option.ConnectionDriverName</name>
8 <value>com.mysql.jdbc.Driver</value>
9 <description>JDBC</description>
10</property>
11<property>
12 <name>javax.jdo.option.ConnectionUserName</name>
13 <value>root</value>
14 <description>username</description>
15 </property>
16<property>
17 <name>javax.jdo.option.ConnectionPassword</name>
18 <value>new-password</value>
19 <description>passward</description>
20</property>
In the configuration settings, javax.jdo.option.ConnectionURL specifies the MySQL server connection URL, javax.jdo.option.ConnectionUserName is the MySQL username for Hive, and javax.jdo.option.ConnectionPassword is the corresponding password. After completing these settings, copy the MySQL JDBC driver to the lib folder. The driver version used locally is mysql-connector-java-5.1.32-bin.jar.
Initialize MySQL
1./bin/schematool -dbType mysql -initSchema
Start Hive
1./bin/hive
Hive testing
Create a table
1create database hive; // Create a database
2 create table hive_test (a int, b string) //Create a table
3ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
- Create a new shell script named gen_data.sh
1#!/bin/bash
2 MAXROW=1000000 #Specify the number of data rows to generate
3for((i = 0; i < $MAXROW; i++))
4do
5 echo $RANDOM, \"$RANDOM\"
6done
- Run the script to generate test data
1./gen_data.sh > hive_test.data
- Load the data into the table
1load data inpath "bos://${bucket_name}/hive_test.data" into table hive.hive_test;
- Query
1hive> select count(*) from hive_test;
2WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
3Query ID = root_20230528173013_6f5296db-562e-4342-917f-bcf14fc1480d
4Total jobs = 1
5Launching Job 1 out of 1
6Number of reduce tasks determined at compile time: 1
7In order to change the average load for a reducer (in bytes):
8 set hive.exec.reducers.bytes.per.reducer=<number>
9In order to limit the maximum number of reducers:
10 set hive.exec.reducers.max=<number>
11In order to set a constant number of reducers:
12 set mapreduce.job.reduces=<number>
13Job running in-process (local Hadoop)
142023-05-28 17:30:16,548 Stage-1 map = 0%, reduce = 0%
152023-05-28 17:30:18,558 Stage-1 map = 100%, reduce = 100%
16Ended Job = job_local238749048_0001
17MapReduce Jobs Launched:
18Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS
19Total MapReduce CPU Time Spent: 0 msec
20OK
211000000
22hive> select * from hive_test limit 10;
23OK
2411027 "11345"
2510227 "24281"
2632535 "16409"
2724286 "24435"
282498 "10969"
2916662 "16163"
305345 "26005"
3121407 "5365"
3230608 "4588"
3319686 "11831"
