Impala Practice Based on BOS
Impala
Impala is an MPP (Massive Parallel Processing) SQL query engine designed for handling large datasets stored in Hadoop clusters. It is open-source software written in C++ and Java. Compared to other SQL engines for Hadoop, Impala offers superior performance and lower query latency.
Installation steps
Install metastore
Refer to the “Presto Access Based on S3” section in the article Presto Practices to install and configure the metastore
Install impala
- Download the RPM package from the address: http://archive.cloudera.com/cdh5/repo-as-tarball/5.14.0/cdh5.14.0-centos6.tar.gz Unzip it
1tar -zxvf cdh5.14.0-centos6.tar.gz
2cd cdh/5.14.0
Run the command: python -m SimpleHTTPServer 8092 &
Create a local server
- Configure the local YUM repository
vim /etc/yum.repos.d/localimp.repo Enter
1[localimp]
2name=localimp
3baseurl=http://127.0.0.1:8092/
4gpgcheck=0
5enabled=1
- Install Impala using the following command
1yum install -y impala impala-server impala-state-store impala-catalog impala-shell
- Then copy the Hive configuration file (metastore-site.xml) to the Impala configuration path:
1cp metastore/conf/metastore-site.xml /etc/impala/conf/hive-site.xml
- Add S3 configuration
Edit the file by running vim /etc/impala/conf/core-site.xml, and you can refer to Impala-S3 Configuration
1<configuration>
2 <property>
3 <name>fs.s3a.block.size</name>
4 <value>134217728 </value>
5 </property>
6<property>
7 <name>fs.azure.user.agent.prefix</name>
8 <value>User-Agent: APN/1.0 Hortonworks/1.0 HDP/None</value>
9</property>
10<property>
11 <name>fs.s3a.connection.maximum</name>
12 <value>1500</value>
13</property>
14<property>
15 <name>fs.defaultFS</name>
16 <value>s3a://${bucket}</value>
17 </property>
18<property>
19 <name>fs.s3a.endpoint</name>
20 <value>s3.bj.bcebos.com</value>
21 <description>endpoint</description>
22 </property>
23<property>
24 <name>fs.s3a.access.key</name>
25 <value>${AK}</value>
26 <description>AK</description>
27 </property>
28<property>
29 <name>fs.s3a.secret.key</name>
30 <value>${SK}</value>
31 <description>SK</description>
32 </property>
33</configuration>
- Modify Bigtop configuration. Set JAVA_HOME and ensure that the Impala user has access permissions
Open the configuration file by running vim /etc/default/bigtop-utils and enter
1export JAVA_HOME=/export/servers/jdk1.8.0_65
- Set up a soft link for the MySQL driver:
1ln -s mysql-connector-java-5.1.32.jar /usr/share/java/mysql-connector-java.jar
- Start Impala
1service impala-state-store start
2service impala-catalog start
3service impala-server start
After starting, you can check the logs in the /var/log/impala folder to troubleshoot failure reasons
- Once it starts normally, run the impala-shell command:
1[root@my-node impala]# impala-shell
2Starting Impala Shell without Kerberos authentication
3Connected to my-node:21000
4Server version: impalad version 2.11.0-cdh5.14.0 RELEASE (build d68206561bce6b26762d62c01a78e6cd27aa7690)
5***********************************************************************************
6Welcome to the Impala shell.
7(Impala Shell v2.11.0-cdh5.14.0 (d682065) built on Sat Jan 6 13:27:16 PST 2018)
8When pretty-printing is disabled, you can use the '--output_delimiter' flag to set
9the delimiter for fields in the same row. The default is ','.
10***********************************************************************************
11[my-node:21000] > show databases;
12Query: show databases
13+------------------+----------------------------------------------+
14| name | comment |
15+------------------+----------------------------------------------+
16| _impala_builtins | System database for Impala builtin functions |
17| default | Default Hive database |
18+------------------+----------------------------------------------+
19Fetched 2 row(s) in 0.16s
20[my-node:21000] > CREATE DATABASE db_on_s3 LOCATION 's3a://my-bigdata/impala/s3';
21Query: create DATABASE db_on_s3 LOCATION 's3a://my-bigdata/impala/s3'
22WARNINGS: Path 's3a://my-bigdata/impala' cannot be reached: Path does not exist.
23Fetched 0 row(s) in 2.51s
24[my-node:21000] > show databases;
25Query: show databases
26+------------------+----------------------------------------------+
27| name | comment |
28+------------------+----------------------------------------------+
29| _impala_builtins | System database for Impala builtin functions |
30| db_on_s3 | |
31| default | Default Hive database |
32+------------------+----------------------------------------------+
33Fetched 3 row(s) in 0.01s
34[my-node:21000] > use db_on_s3;
35Query: use db_on_s3
36[my-node:21000] > create table hive_test (a int, b string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
37Query: create table hive_test (a int, b string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
38Fetched 0 row(s) in 2.11s
39[my-node:21000] > insert into hive_test(a, b) values(1,'tom');
40Query: insert into hive_test(a, b) values(1,'tom')
41Query submitted at: 2023-09-13 19:20:26 (Coordinator: http://my-node:25000)
42Query progress can be monitored at: http://my-node:25000/query_plan?query_id=ec4463f20d37dfe4:5192e94f00000000
43Modified 1 row(s) in 7.57s
44[my-node:21000] > insert into hive_test(a, b) values(2,'jerry');
45Query: insert into hive_test(a, b) values(2,'jerry')
46Query submitted at: 2023-09-13 19:20:42 (Coordinator: http://my-node:25000)
47Query progress can be monitored at: http://my-node:25000/query_plan?query_id=694061adf492a154:4a24912d00000000
48Modified 1 row(s) in 1.02s
You can see the newly generated files in the corresponding path:

