Presto Practice Based on BOS
Overview
Presto is a data query engine developed by Facebook that enables fast, interactive analysis of massive datasets and supports various data sources like Hive and relational databases. With BOS offering ultra-low cost, high performance, excellent reliability, and high throughput, more enterprises are opting for BOS as their big data storage solution. This article will provide a brief introduction to using Presto on BOS.
Use Presto based on BOS HDFS
BOS HDFS
First, refer to the article BOS HDFS to install and configure BOS HDFS. The Hadoop version installed on the local machine is hadoop-3.3.2. Refer to the “Getting Started” section in the article to complete the basic trial of BOS HDFS and set environment variables:
1export HADOOP_HOME=/opt/hadoop-3.3.2
2export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath
Hive
Hive is a data warehousing tool built on Hadoop, designed for data extraction, transformation, and loading. It provides a mechanism to store, query, and analyze large-scale data stored in Hadoop. Hive can map structured data files to database tables and offers SQL query capabilities, transforming SQL statements into MapReduce tasks for execution.
Install MySQL
MySQL is used to store Hive’s metadata. You can choose to install it locally, or directly connect remotely to an already installed MySQL or RDS. The version installed on the local machine is: mysql-5.1.61-4.el6.x86_64
After installation, you can use
Service MySQL Status to check the running status, and use
1/usr/bin/mysqladmin -u root -h ${IP} password ${new-password} #Set a new password
You can create a dedicated user for Hive in MySQL and set a unique password.
Install Hive
The version installed on the local machine is 2.3.9.
- Modify two configurations in the conf folder:
1mv hive-env.sh.template hive-env.sh
2mv hive-site.xml.template hive-site.xml
Add the following content to hive-env.sh:
1export HIVE_CONF_DIR=/ssd2/apache-hive-2.3.9-bin/conf
Add the following content to hive-site.xml:
1<property>
2 <name>javax.jdo.option.ConnectionURL</name>
3 <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
4 <description>MySQL</description>
5 </property>
6 <property>
7 <name>javax.jdo.option.ConnectionDriverName</name>
8 <value>com.mysql.jdbc.Driver</value>
9 <description>JDBC</description>
10 </property>
11 <property>
12 <name>javax.jdo.option.ConnectionUserName</name>
13 <value>root</value>
14 <description>username</description>
15 </property>
16 <property>
17 <name>javax.jdo.option.ConnectionPassword</name>
18 <value>new-password</value>
19 <description>passward</description>
20 </property>
In the configuration, javax.jdo.option.ConnectionURL specifies the MySQL server connection address, javax.jdo.option.ConnectionUserName is the MySQL username used by Hive, and javax.jdo.option.ConnectionPassword is the corresponding password. Once configured, copy the MySQL JDBC driver into the lib folder. The driver used on the local system is mysql-connector-java-5.1.32-bin.jar.
- Initialize MySQL
1./bin/schematool -dbType mysql -initSchema
- Start Hive
1./bin/hive
- Hive testing
- Create a table
1create database hive; // Create a database
2 create table hive_test (a int, b string) //Create a table
3ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
- Create a new shell script named gen_data.sh
1#!/bin/bash
2 MAXROW=1000000 #Specify the number of data rows to generate
3for((i = 0; i < $MAXROW; i++))
4do
5 echo $RANDOM, \"$RANDOM\"
6done
- Run the script to generate test data
1./gen_data.sh > hive_test.data
- Load the data into the table
1load data inpath "bos://${bucket_name}/hive_test.data" into table hive.hive_test;
- Query
1hive> select count(*) from hive_test;
2WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
3Query ID = root_20230528173013_6f5296db-562e-4342-917f-bcf14fc1480d
4Total jobs = 1
5Launching Job 1 out of 1
6Number of reduce tasks determined at compile time: 1
7In order to change the average load for a reducer (in bytes):
8 set hive.exec.reducers.bytes.per.reducer=<number>
9In order to limit the maximum number of reducers:
10 set hive.exec.reducers.max=<number>
11In order to set a constant number of reducers:
12 set mapreduce.job.reduces=<number>
13Job running in-process (local Hadoop)
142023-05-28 17:30:16,548 Stage-1 map = 0%, reduce = 0%
152023-05-28 17:30:18,558 Stage-1 map = 100%, reduce = 100%
16Ended Job = job_local238749048_0001
17MapReduce Jobs Launched:
18Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS
19Total MapReduce CPU Time Spent: 0 msec
20OK
211000000
22hive> select * from hive_test limit 10;
23OK
2411027 "11345"
2510227 "24281"
2632535 "16409"
2724286 "24435"
282498 "10969"
2916662 "16163"
305345 "26005"
3121407 "5365"
3230608 "4588"
3319686 "11831"
Presto
The installed version is 349, and you can refer to the process in the article Presto Installation. Among them, if the local machine also serves as a worker, in config.properties,
1node-scheduler.include-coordinator=true #Need to be changed to true
In etc/catalog/hive.properties, configure as follows:
1connector.name=hive-hadoop2
2 hive.config.resources=/ssd2/hadoop-3.3.2/etc/hadoop/core-site.xml,/ssd2/hadoop-3.3.2/etc/hadoop/hdfs-site.xml #The address here must be correct
3hive.metastore.uri=thrift://127.0.0.1:9083
4hive.allow-drop-table=false
5hive.storage-format=ORC
6hive.metastore-cache-ttl=1s
7hive.metastore-refresh-interval=1s
8hive.metastore-timeout=35m
9hive.max-partitions-per-writers=1000
10hive.cache.enabled=true
11hive.cache.location=/opt/hive-cache
Copy the bos filesystem jar package to plugin/hive-hadoop2/, then run:
1./bin/launcher start
Start the presto-server,
1./presto-cli --server localhost:8881 --catalog hive --schema default
Run
1presto:default>use hive;
2 USE
3 presto:hive>select * from hive_test limit 10;
4 a | b
5-------+----------
6 11027 | "11345"
7 10227 | "24281"
8 32535 | "16409"
9 24286 | "24435"
10 2498 | "10969"
11 16662 | "16163"
12 5345 | "26005"
13 21407 | "5365"
14 30608 | "4588"
15 19686 | "11831"
16 (10 rows)
17Query 20230601_084130_00004_dzvjb, FINISHED, 1 node
18Splits: 18 total, 18 done (100.00%)
19[Latency: client-side: 0:02, server-side: 0:02] [59.4K rows, 831KB] [28.9K rows/s, 404KB/s]
In the example above, we use Presto to query data stored in BOS.
Presto access based on S3
Presto can access data stored in BOS only through Hive. However, Hive can interact with BOS in two ways: the first is via the aforementioned bos-hdfs method, and the second is by directly connecting to BOS using the S3 protocol.
Hive configuration
Refer to the steps in Installing Metastore:
1wget "https://repo1.maven.org/maven2/org/apache/hive/hive-standalone-metastore/3.1.2/hive-standalone-metastore-3.1.2-bin.tar.gz"
2tar -zxvf hive-standalone-metastore-3.1.2-bin.tar.gz
3sudo mv apache-hive-metastore-3.1.2-bin /usr/local/metastore
4sudo chown user:user /usr/local/metastore
Download and use hive-standalone-metastore, and add the necessary jar packages:
1rm /usr/local/metastore/lib/guava-19.0.jar
2cp /usr/local/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar \
3 /usr/local/metastore/lib/
4cp /usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-3.2.1.jar \
5 /usr/local/metastore/lib/
6cp /usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar \
7 /usr/local/metastore/lib/
Then configure /usr/local/metastore/conf/metastore-site.xml.
1<property>
2 <name>javax.jdo.option.ConnectionURL</name>
3 <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
4</property>
5<property>
6 <name>javax.jdo.option.ConnectionDriverName</name>
7 <value>com.mysql.jdbc.Driver</value>
8</property>
9<property>
10 <name>javax.jdo.option.ConnectionUserName</name>
11 <value>hive</value>
12</property>
13<property>
14 <name>javax.jdo.option.ConnectionPassword</name>
15 <value>hive</value>
16</property>
17<property>
18 <name>hive.metastore.event.db.notification.api.auth</name>
19 <value>false</value>
20</property>
21<property>
22 <name>fs.s3a.access.key</name>
23 <value>S3_ACCESS_KEY</value>
24</property>
25<property>
26 <name>fs.s3a.secret.key</name>
27 <value>S3_SECRET_KEY</value>
28</property>
29<property>
30 <name>fs.s3a.connection.ssl.enabled</name>
31 <value>false</value>
32</property>
33<property>
34 <name>fs.s3a.path.style.access</name>
35 <value>true</value>
36</property>
37<property>
38 <name>fs.s3a.endpoint</name>
39 <value>S3_ENDPOINT</value>
40</property>
fs.s3a.endpoint It refers to the S3 endpoint. The S3 endpoint of BOS can be found in BOS S3 Domain Name. Start hive metastore:
1/usr/local/metastore/bin/start-metastore &
Presto configuration
The version of Presto remains unchanged here, and the content of hive.properties is changed to:
1connector.name=hive-hadoop2
2hive.metastore.uri=thrift://localhost:9083
3hive.s3.path-style-access=true
4hive.s3.endpoint=S3_ENDPOINT
5hive.s3.aws-access-key=S3_ACCESS_KEY
6hive.s3.aws-secret-key=S3_SECRET_KEY
7hive.s3.ssl.enabled=false
Start Presto
1/usr/local/trino/bin/launcher start
#Create a Schema
1CREATE SCHEMA IF NOT EXISTS hive.iris
2 WITH (location = 's3a://my-bos-bucket/'); //Note that the location here starts with s3a. If the bucket is my-bos-bucket, the location should start with s3a://my-bos-bucket/
3 #Create a table
4CREATE TABLE IF NOT EXISTS hive.iris.iris_parquet (
5 sepal_length DOUBLE,
6 sepal_width DOUBLE,
7 petal_length DOUBLE,
8 petal_width DOUBLE,
9 class VARCHAR
10)
11WITH (
12 external_location = 's3a://my-bos-bucket/iris_parquet',
13 format = 'PARQUET'
14);
15SELECT
16 sepal_length,
17 class
18FROM hive.iris.iris_parquet
19LIMIT 10;
