Presto Practice Based on BOS

Updated at：2025-11-03

Overview

Presto is a data query engine developed by Facebook that enables fast, interactive analysis of massive datasets and supports various data sources like Hive and relational databases. With BOS offering ultra-low cost, high performance, excellent reliability, and high throughput, more enterprises are opting for BOS as their big data storage solution. This article will provide a brief introduction to using Presto on BOS.

Use Presto based on BOS HDFS

BOS HDFS

First, refer to the article BOS HDFS to install and configure BOS HDFS. The Hadoop version installed on the local machine is hadoop-3.3.2. Refer to the “Getting Started” section in the article to complete the basic trial of BOS HDFS and set environment variables:

Plain Text

1export HADOOP_HOME=/opt/hadoop-3.3.2
2export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath

Hive

Hive is a data warehousing tool built on Hadoop, designed for data extraction, transformation, and loading. It provides a mechanism to store, query, and analyze large-scale data stored in Hadoop. Hive can map structured data files to database tables and offers SQL query capabilities, transforming SQL statements into MapReduce tasks for execution.

Install MySQL

MySQL is used to store Hive’s metadata. You can choose to install it locally, or directly connect remotely to an already installed MySQL or RDS. The version installed on the local machine is: mysql-5.1.61-4.el6.x86_64 After installation, you can use Service MySQL Status to check the running status, and use

Plain Text

1/usr/bin/mysqladmin -u root -h ${IP} password ${new-password} #Set a new password

You can create a dedicated user for Hive in MySQL and set a unique password.

Install Hive

The version installed on the local machine is 2.3.9.

Modify two configurations in the conf folder:

Plain Text

1mv hive-env.sh.template hive-env.sh
2mv hive-site.xml.template hive-site.xml

Add the following content to hive-env.sh:

Plain Text

1export HIVE_CONF_DIR=/ssd2/apache-hive-2.3.9-bin/conf

Add the following content to hive-site.xml:

Plain Text

1<property>
2      <name>javax.jdo.option.ConnectionURL</name>
3      <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
4      <description>MySQL</description>
5  </property>
6  <property>
7        <name>javax.jdo.option.ConnectionDriverName</name>
8        <value>com.mysql.jdbc.Driver</value>
9        <description>JDBC</description>
10  </property>
11  <property>
12        <name>javax.jdo.option.ConnectionUserName</name>
13        <value>root</value>
14        <description>username</description>
15        </property>
16 <property>
17        <name>javax.jdo.option.ConnectionPassword</name>
18        <value>new-password</value>
19        <description>passward</description>
20 </property>

In the configuration, javax.jdo.option.ConnectionURL specifies the MySQL server connection address, javax.jdo.option.ConnectionUserName is the MySQL username used by Hive, and javax.jdo.option.ConnectionPassword is the corresponding password. Once configured, copy the MySQL JDBC driver into the lib folder. The driver used on the local system is mysql-connector-java-5.1.32-bin.jar.

Initialize MySQL

Plain Text

1./bin/schematool -dbType mysql -initSchema

Start Hive

Plain Text

1./bin/hive

Hive testing

Create a table

Plain Text

1create database hive;      // Create a database
2 create table hive_test (a int, b string) //Create a table
3ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Create a new shell script named gen_data.sh

Plain Text

1#!/bin/bash
2 MAXROW=1000000 #Specify the number of data rows to generate
3for((i = 0; i < $MAXROW; i++))
4do
5   echo $RANDOM, \"$RANDOM\"
6done

Run the script to generate test data

Plain Text

1./gen_data.sh > hive_test.data

Load the data into the table

Plain Text

1load data inpath "bos://${bucket_name}/hive_test.data" into table hive.hive_test;

Query

Plain Text

1hive> select count(*) from hive_test;
2WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
3Query ID = root_20230528173013_6f5296db-562e-4342-917f-bcf14fc1480d
4Total jobs = 1
5Launching Job 1 out of 1
6Number of reduce tasks determined at compile time: 1
7In order to change the average load for a reducer (in bytes):
8  set hive.exec.reducers.bytes.per.reducer=<number>
9In order to limit the maximum number of reducers:
10  set hive.exec.reducers.max=<number>
11In order to set a constant number of reducers:
12  set mapreduce.job.reduces=<number>
13Job running in-process (local Hadoop)
142023-05-28 17:30:16,548 Stage-1 map = 0%,  reduce = 0%
152023-05-28 17:30:18,558 Stage-1 map = 100%,  reduce = 100%
16Ended Job = job_local238749048_0001
17MapReduce Jobs Launched:
18Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
19Total MapReduce CPU Time Spent: 0 msec
20OK
211000000
22hive> select * from hive_test limit 10;
23OK
2411027	 "11345"
2510227	 "24281"
2632535	 "16409"
2724286	 "24435"
282498	 "10969"
2916662	 "16163"
305345	 "26005"
3121407	 "5365"
3230608	 "4588"
3319686	 "11831"

Presto

The installed version is 349, and you can refer to the process in the article Presto Installation. Among them, if the local machine also serves as a worker, in config.properties,

Plain Text

1node-scheduler.include-coordinator=true #Need to be changed to true

In etc/catalog/hive.properties, configure as follows:

Plain Text

1connector.name=hive-hadoop2
2 hive.config.resources=/ssd2/hadoop-3.3.2/etc/hadoop/core-site.xml,/ssd2/hadoop-3.3.2/etc/hadoop/hdfs-site.xml #The address here must be correct
3hive.metastore.uri=thrift://127.0.0.1:9083
4hive.allow-drop-table=false
5hive.storage-format=ORC
6hive.metastore-cache-ttl=1s
7hive.metastore-refresh-interval=1s
8hive.metastore-timeout=35m
9hive.max-partitions-per-writers=1000
10hive.cache.enabled=true
11hive.cache.location=/opt/hive-cache

Copy the bos filesystem jar package to plugin/hive-hadoop2/, then run:

Plain Text

1./bin/launcher start

Start the presto-server,

Plain Text

1./presto-cli --server localhost:8881 --catalog hive --schema default

Run

Plain Text

1presto:default>use hive;
2 USE
3 presto:hive>select * from hive_test limit 10;
4     a   |    b
5-------+----------
6 11027 |  "11345"
7 10227 |  "24281"
8 32535 |  "16409"
9 24286 |  "24435"
10  2498 |  "10969"
11 16662 |  "16163"
12  5345 |  "26005"
13 21407 |  "5365"
14 30608 |  "4588"
15 19686 |  "11831"
16 (10 rows)
17Query 20230601_084130_00004_dzvjb, FINISHED, 1 node
18Splits: 18 total, 18 done (100.00%)
19[Latency: client-side: 0:02, server-side: 0:02] [59.4K rows, 831KB] [28.9K rows/s, 404KB/s]

In the example above, we use Presto to query data stored in BOS.

Presto access based on S3

Presto can access data stored in BOS only through Hive. However, Hive can interact with BOS in two ways: the first is via the aforementioned bos-hdfs method, and the second is by directly connecting to BOS using the S3 protocol.

Hive configuration

Refer to the steps in Installing Metastore:

Plain Text

1wget "https://repo1.maven.org/maven2/org/apache/hive/hive-standalone-metastore/3.1.2/hive-standalone-metastore-3.1.2-bin.tar.gz"
2tar -zxvf hive-standalone-metastore-3.1.2-bin.tar.gz
3sudo mv apache-hive-metastore-3.1.2-bin /usr/local/metastore
4sudo chown user:user /usr/local/metastore

Download and use hive-standalone-metastore, and add the necessary jar packages:

Plain Text

1rm /usr/local/metastore/lib/guava-19.0.jar
2cp /usr/local/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar \
3  /usr/local/metastore/lib/
4cp /usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-3.2.1.jar \
5  /usr/local/metastore/lib/
6cp /usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar \
7  /usr/local/metastore/lib/

Then configure /usr/local/metastore/conf/metastore-site.xml.

Plain Text

1<property>
2    <name>javax.jdo.option.ConnectionURL</name>
3    <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
4</property>
5<property>
6    <name>javax.jdo.option.ConnectionDriverName</name>
7    <value>com.mysql.jdbc.Driver</value>
8</property>
9<property>
10    <name>javax.jdo.option.ConnectionUserName</name>
11    <value>hive</value>
12</property>
13<property>
14    <name>javax.jdo.option.ConnectionPassword</name>
15    <value>hive</value>
16</property>
17<property>
18    <name>hive.metastore.event.db.notification.api.auth</name>
19    <value>false</value>
20</property>
21<property>
22    <name>fs.s3a.access.key</name>
23    <value>S3_ACCESS_KEY</value>
24</property>
25<property>
26    <name>fs.s3a.secret.key</name>
27    <value>S3_SECRET_KEY</value>
28</property>
29<property>
30    <name>fs.s3a.connection.ssl.enabled</name>
31    <value>false</value>
32</property>
33<property>
34    <name>fs.s3a.path.style.access</name>
35    <value>true</value>
36</property>
37<property>
38    <name>fs.s3a.endpoint</name>
39    <value>S3_ENDPOINT</value>
40</property>

fs.s3a.endpoint It refers to the S3 endpoint. The S3 endpoint of BOS can be found in BOS S3 Domain Name. Start hive metastore:

Plain Text

1/usr/local/metastore/bin/start-metastore &

Presto configuration

The version of Presto remains unchanged here, and the content of hive.properties is changed to:

Plain Text

1connector.name=hive-hadoop2
2hive.metastore.uri=thrift://localhost:9083
3hive.s3.path-style-access=true
4hive.s3.endpoint=S3_ENDPOINT
5hive.s3.aws-access-key=S3_ACCESS_KEY
6hive.s3.aws-secret-key=S3_SECRET_KEY
7hive.s3.ssl.enabled=false

Start Presto

Plain Text

1/usr/local/trino/bin/launcher start

#Create a Schema

Plain Text

1CREATE SCHEMA IF NOT EXISTS hive.iris
2 WITH (location = 's3a://my-bos-bucket/'); //Note that the location here starts with s3a. If the bucket is my-bos-bucket, the location should start with s3a://my-bos-bucket/
3 #Create a table
4CREATE TABLE IF NOT EXISTS hive.iris.iris_parquet (
5  sepal_length DOUBLE,
6  sepal_width  DOUBLE,
7  petal_length DOUBLE,
8  petal_width  DOUBLE,
9  class        VARCHAR
10)
11WITH (
12  external_location = 's3a://my-bos-bucket/iris_parquet',
13  format = 'PARQUET'
14);
15SELECT
16  sepal_length,
17  class
18FROM hive.iris.iris_parquet
19LIMIT 10;

BOS Dedicated Line Back-to-Origin IDC Best Practices

Accessing BOS via Dedicated Line-VPN in Hybrid Cloud