Hive Usage Guide

Updated at：2025-11-03

Hive

Hive is a data warehousing tool built on Hadoop, designed for data extraction, transformation, and loading. It provides a mechanism to store, query, and analyze large-scale datasets stored in Hadoop. Hive can map structured data files to database tables and offers SQL query functionality, converting SQL statements into MapReduce tasks for execution.

Prerequisites

First, refer to the document BOS HDFS to install and configure BOS HDFS. The Hadoop version installed on the local machine is hadoop-3.3.2. Refer to the “Getting Started” section in the document to complete the basic trial of BOS HDFS and set environment variables:

                Bash
                
                export HADOOP_HOME=/opt/hadoop-3.3.2
export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath

Install MySQL

MySQL serves as the storage for Hive's metadata. You can either install MySQL locally or connect remotely to an existing MySQL or RDS instance. The local version used is mysql-5.1.61-4.el6.x86_64. After installation, you can check its operation status using the command service mysqld status and proceed with configuration.

                Bash
                
                /usr/bin/mysqladmin -u root -h ${IP} password ${new-password} #Set a new password

You can create a dedicated MySQL user for Hive and configure a password for added security.

Install Hive

The version installed on the local machine is 2.3.9. Modify two configurations in the conf folder:

Bash

1mv hive-env.sh.template hive-env.sh
2mv hive-site.xml.template hive-site.xml

Add the following content to hive-env.sh:

                Bash
                
                export HIVE_CONF_DIR=/ssd2/apache-hive-2.3.9-bin/conf

Add the following content to hive-site.xml:

                XML
                
            

                <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
    <description>MySQL</description>
</property>
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>JDBC</description>
</property>
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
    <description>username</description>
    </property>
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>new-password</value>
    <description>passward</description>
</property>
            

In the configuration settings, javax.jdo.option.ConnectionURL specifies the MySQL server connection URL, javax.jdo.option.ConnectionUserName is the MySQL username for Hive, and javax.jdo.option.ConnectionPassword is the corresponding password. After completing these settings, copy the MySQL JDBC driver to the lib folder. The driver version used locally is mysql-connector-java-5.1.32-bin.jar.

Initialize MySQL

Bash

1./bin/schematool -dbType mysql -initSchema

Start Hive

Bash

1./bin/hive

Hive testing

Create a table

                Bash
                
                create database hive;      // Create a database
 create table hive_test (a int, b string) //Create a table
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Create a new shell script named gen_data.sh

                Bash
                
            

                #!/bin/bash
 MAXROW=1000000 #Specify the number of data rows to generate
for((i = 0; i < $MAXROW; i++))
do
   echo $RANDOM, \"$RANDOM\"
done
            

Run the script to generate test data

Bash

1./gen_data.sh > hive_test.data

Load the data into the table

                Bash
                
                load data inpath "bos://${bucket_name}/hive_test.data" into table hive.hive_test;

Query

                Bash
                
            

                hive> select count(*) from hive_test;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20230528173013_6f5296db-562e-4342-917f-bcf14fc1480d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2023-05-28 17:30:16,548 Stage-1 map = 0%,  reduce = 0%
2023-05-28 17:30:18,558 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local238749048_0001
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1000000
hive> select * from hive_test limit 10;
OK
11027    "11345"
10227    "24281"
32535    "16409"
24286    "24435"
2498     "10969"
16662    "16163"
5345     "26005"
21407    "5365"
30608    "4588"
19686    "11831"
            

Kafka Data Storage to BOS

Presto Usage Guide