Impala Practice Based on BOS

Updated at：2025-11-03

Impala

Impala is an MPP (Massive Parallel Processing) SQL query engine designed for handling large datasets stored in Hadoop clusters. It is open-source software written in C++ and Java. Compared to other SQL engines for Hadoop, Impala offers superior performance and lower query latency.

Installation steps

Install metastore

Refer to the “Presto Access Based on S3” section in the article Presto Practices to install and configure the metastore

Install impala

Download the RPM package from the address: http://archive.cloudera.com/cdh5/repo-as-tarball/5.14.0/cdh5.14.0-centos6.tar.gz Unzip it

Plain Text

1tar -zxvf cdh5.14.0-centos6.tar.gz
2cd cdh/5.14.0

Run the command: python -m SimpleHTTPServer 8092 & Create a local server

Configure the local YUM repository

vim /etc/yum.repos.d/localimp.repo Enter

Plain Text

1[localimp]
2name=localimp
3baseurl=http://127.0.0.1:8092/
4gpgcheck=0
5enabled=1

Install Impala using the following command

Plain Text

1yum install -y impala impala-server impala-state-store impala-catalog impala-shell

Then copy the Hive configuration file (metastore-site.xml) to the Impala configuration path:

Plain Text

1cp metastore/conf/metastore-site.xml  /etc/impala/conf/hive-site.xml

Add S3 configuration

Edit the file by running vim /etc/impala/conf/core-site.xml, and you can refer to Impala-S3 Configuration

Plain Text

1<configuration>
2 <property>
3     <name>fs.s3a.block.size</name>
4     <value>134217728 </value>
5 </property>
6<property>
7    <name>fs.azure.user.agent.prefix</name>
8    <value>User-Agent: APN/1.0 Hortonworks/1.0 HDP/None</value>
9</property>
10<property>
11    <name>fs.s3a.connection.maximum</name>
12    <value>1500</value>
13</property>
14<property>
15    <name>fs.defaultFS</name>
16        <value>s3a://${bucket}</value>
17        </property>
18<property>
19    <name>fs.s3a.endpoint</name>
20        <value>s3.bj.bcebos.com</value>
21            <description>endpoint</description>
22            </property>
23<property>
24    <name>fs.s3a.access.key</name>
25        <value>${AK}</value>
26            <description>AK</description>
27            </property>
28<property>
29    <name>fs.s3a.secret.key</name>
30        <value>${SK}</value>
31            <description>SK</description>
32            </property>
33</configuration>

Modify Bigtop configuration. Set JAVA_HOME and ensure that the Impala user has access permissions

Open the configuration file by running vim /etc/default/bigtop-utils and enter

Plain Text

1export JAVA_HOME=/export/servers/jdk1.8.0_65

Set up a soft link for the MySQL driver:

Plain Text

1ln -s mysql-connector-java-5.1.32.jar /usr/share/java/mysql-connector-java.jar

Start Impala

Plain Text

1service impala-state-store start
2service impala-catalog start
3service impala-server start

After starting, you can check the logs in the /var/log/impala folder to troubleshoot failure reasons

Once it starts normally, run the impala-shell command:

Plain Text

1[root@my-node impala]# impala-shell
2Starting Impala Shell without Kerberos authentication
3Connected to my-node:21000
4Server version: impalad version 2.11.0-cdh5.14.0 RELEASE (build d68206561bce6b26762d62c01a78e6cd27aa7690)
5***********************************************************************************
6Welcome to the Impala shell.
7(Impala Shell v2.11.0-cdh5.14.0 (d682065) built on Sat Jan  6 13:27:16 PST 2018)
8When pretty-printing is disabled, you can use the '--output_delimiter' flag to set
9the delimiter for fields in the same row. The default is ','.
10***********************************************************************************
11[my-node:21000] > show databases;
12Query: show databases
13+------------------+----------------------------------------------+
14| name             | comment                                      |
15+------------------+----------------------------------------------+
16| _impala_builtins | System database for Impala builtin functions |
17| default          | Default Hive database                        |
18+------------------+----------------------------------------------+
19Fetched 2 row(s) in 0.16s
20[my-node:21000] > CREATE DATABASE db_on_s3 LOCATION 's3a://my-bigdata/impala/s3';
21Query: create DATABASE db_on_s3 LOCATION 's3a://my-bigdata/impala/s3'
22WARNINGS: Path 's3a://my-bigdata/impala' cannot be reached: Path does not exist.
23Fetched 0 row(s) in 2.51s
24[my-node:21000] > show databases;
25Query: show databases
26+------------------+----------------------------------------------+
27| name             | comment                                      |
28+------------------+----------------------------------------------+
29| _impala_builtins | System database for Impala builtin functions |
30| db_on_s3         |                                              |
31| default          | Default Hive database                        |
32+------------------+----------------------------------------------+
33Fetched 3 row(s) in 0.01s
34[my-node:21000] > use db_on_s3;
35Query: use db_on_s3
36[my-node:21000] > create table hive_test (a int, b string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
37Query: create table hive_test (a int, b string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
38Fetched 0 row(s) in 2.11s
39[my-node:21000] > insert into hive_test(a, b) values(1,'tom');
40Query: insert into hive_test(a, b) values(1,'tom')
41Query submitted at: 2023-09-13 19:20:26 (Coordinator: http://my-node:25000)
42Query progress can be monitored at: http://my-node:25000/query_plan?query_id=ec4463f20d37dfe4:5192e94f00000000
43Modified 1 row(s) in 7.57s
44[my-node:21000] > insert into hive_test(a, b) values(2,'jerry');
45Query: insert into hive_test(a, b) values(2,'jerry')
46Query submitted at: 2023-09-13 19:20:42 (Coordinator: http://my-node:25000)
47Query progress can be monitored at: http://my-node:25000/query_plan?query_id=694061adf492a154:4a24912d00000000
48Modified 1 row(s) in 1.02s

You can see the newly generated files in the corresponding path:

image (1).png

AWS Lambda Synchronizing S3 Data to BOS

Direct Upload of Web Data to BOS