基于BOS的Presto实践
概览
Presto是Facebook开发的数据查询引擎,可对海量数据进行快速地交互式分析,支持Hive,关系数据库等多种数据源。由于 BOS 在超低价格、超高性能、高可靠和高吞吐的强大存储优势,越来越多企业选择 BOS 作为大数据的存储媒介。因此,本文将对Presto在 BOS 上的使用方法作一个简要的介绍。
基于BOS HDFS的Presto使用
BOS HDFS
首先参考BOS HDFS一文安装并配置BOS HDFS,本机安装的hadoop版本为hadoop-3.3.2,参考文中"开始使用"一节完成BOS HDFS的基本试用,并设置环境变量:
1export HADOOP_HOME=/opt/hadoop-3.3.2
2export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath
Hive
hive是基于Hadoop的一个数据仓库工具,用来进行数据提取、转化、加载,这是一种可以存储、查询和分析存储在Hadoop中的大规模数据的机制。hive数据仓库工具能将结构化的数据文件映射为一张数据库表,并提供SQL查询功能,能将SQL语句转变成MapReduce任务来执行.
安装mysql
mysql用来存储hive的元数据,可以选择本地安装,也可以直接远程连接已安装的mysql或者RDS。本机安装的版本是:mysql-5.1.61-4.el6.x86_64
安装完成之后,可用
service mysqld status查看运行状态,并使用
1/usr/bin/mysqladmin -u root -h ${IP} password ${new-password} #设置新密码
可以在mysql中创建hive专用的用户,并设置密码。
安装hive
本机安装的版本为2.3.9.
- 修改conf文件夹下的两个配置:
1mv hive-env.sh.template hive-env.sh
2mv hive-site.xml.template hive-site.xml
在hive-env.sh中添加:
1export HIVE_CONF_DIR=/ssd2/apache-hive-2.3.9-bin/conf
在hive-site.xml中添加:
1<property>
2 <name>javax.jdo.option.ConnectionURL</name>
3 <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
4 <description>MySQL</description>
5 </property>
6 <property>
7 <name>javax.jdo.option.ConnectionDriverName</name>
8 <value>com.mysql.jdbc.Driver</value>
9 <description>JDBC</description>
10 </property>
11 <property>
12 <name>javax.jdo.option.ConnectionUserName</name>
13 <value>root</value>
14 <description>username</description>
15 </property>
16 <property>
17 <name>javax.jdo.option.ConnectionPassword</name>
18 <value>new-password</value>
19 <description>passward</description>
20 </property>
配置中 javax.jdo.option.ConnectionURL 是指mysql server的连接地址,javax.jdo.option.ConnectionUserName 就是用于hive的mysql用户名,javax.jdo.option.ConnectionPassword 是用户名对应的密码。设置完成后,把mysql的JBDC驱动复制到lib文件夹下,本机采用的驱动为 mysql-connector-java-5.1.32-bin.jar.
- 初始化msyql
1./bin/schematool -dbType mysql -initSchema
- 启动hive
1./bin/hive
- hive测试
- 创建表
1create database hive; // 创建数据库
2create table hive_test (a int, b string) //创建表
3ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
- 新建一个shell脚本,名为gen_data.sh
1#!/bin/bash
2MAXROW=1000000 #指定生成数据行数
3for((i = 0; i < $MAXROW; i++))
4do
5 echo $RANDOM, \"$RANDOM\"
6done
- 运行脚本,生成测试数据
1./gen_data.sh > hive_test.data
- 把数据加载到表中
1load data inpath "bos://${bucket_name}/hive_test.data" into table hive.hive_test;
- 查询
1hive> select count(*) from hive_test;
2WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
3Query ID = root_20230528173013_6f5296db-562e-4342-917f-bcf14fc1480d
4Total jobs = 1
5Launching Job 1 out of 1
6Number of reduce tasks determined at compile time: 1
7In order to change the average load for a reducer (in bytes):
8 set hive.exec.reducers.bytes.per.reducer=<number>
9In order to limit the maximum number of reducers:
10 set hive.exec.reducers.max=<number>
11In order to set a constant number of reducers:
12 set mapreduce.job.reduces=<number>
13Job running in-process (local Hadoop)
142023-05-28 17:30:16,548 Stage-1 map = 0%, reduce = 0%
152023-05-28 17:30:18,558 Stage-1 map = 100%, reduce = 100%
16Ended Job = job_local238749048_0001
17MapReduce Jobs Launched:
18Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS
19Total MapReduce CPU Time Spent: 0 msec
20OK
211000000
22
23hive> select * from hive_test limit 10;
24OK
2511027 "11345"
2610227 "24281"
2732535 "16409"
2824286 "24435"
292498 "10969"
3016662 "16163"
315345 "26005"
3221407 "5365"
3330608 "4588"
3419686 "11831"
Presto
安装版本为349,可参考presto安装一文的过程。其中,如果本机也作为worker,在config.properties中,
1node-scheduler.include-coordinator=true #需要改为true
在etc/catalog/hive.properties中,配置为:
1connector.name=hive-hadoop2
2hive.config.resources=/ssd2/hadoop-3.3.2/etc/hadoop/core-site.xml,/ssd2/hadoop-3.3.2/etc/hadoop/hdfs-site.xml #这里的地址一定要正确
3hive.metastore.uri=thrift://127.0.0.1:9083
4hive.allow-drop-table=false
5hive.storage-format=ORC
6hive.metastore-cache-ttl=1s
7hive.metastore-refresh-interval=1s
8hive.metastore-timeout=35m
9hive.max-partitions-per-writers=1000
10hive.cache.enabled=true
11hive.cache.location=/opt/hive-cache
把bos filesystem的jar包复制到plugin/hive-hadoop2/下,之后运行:
1./bin/launcher start
启动presto-server,
1./presto-cli --server localhost:8881 --catalog hive --schema default
运行
1presto:default>use hive;
2 USE
3 presto:hive>select * from hive_test limit 10;
4 a | b
5-------+----------
6 11027 | "11345"
7 10227 | "24281"
8 32535 | "16409"
9 24286 | "24435"
10 2498 | "10969"
11 16662 | "16163"
12 5345 | "26005"
13 21407 | "5365"
14 30608 | "4588"
15 19686 | "11831"
16 (10 rows)
17
18Query 20230601_084130_00004_dzvjb, FINISHED, 1 node
19Splits: 18 total, 18 done (100.00%)
20[Latency: client-side: 0:02, server-side: 0:02] [59.4K rows, 831KB] [28.9K rows/s, 404KB/s]
在以上示例中,我们就通过presto查询到了存储在bos中的数据。
基于S3的presto访问
presto访问存储在BOS中的数据只能是通过hive,但是hive访问BOS中的数据有两种方式,第一种就是通过上述的介绍,基于bos-hdfs;第二种就是直接通过S3协议访问BOS。
hive配置
参考安装metastore中的步骤:
1wget "https://repo1.maven.org/maven2/org/apache/hive/hive-standalone-metastore/3.1.2/hive-standalone-metastore-3.1.2-bin.tar.gz"
2tar -zxvf hive-standalone-metastore-3.1.2-bin.tar.gz
3sudo mv apache-hive-metastore-3.1.2-bin /usr/local/metastore
4sudo chown user:user /usr/local/metastore
下载并使用hive-standalone-metastore,增加必需的jar包:
1rm /usr/local/metastore/lib/guava-19.0.jar
2cp /usr/local/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar \
3 /usr/local/metastore/lib/
4cp /usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-3.2.1.jar \
5 /usr/local/metastore/lib/
6cp /usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar \
7 /usr/local/metastore/lib/
之后配置/usr/local/metastore/conf/metastore-site.xml
1<property>
2 <name>javax.jdo.option.ConnectionURL</name>
3 <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
4</property>
5<property>
6 <name>javax.jdo.option.ConnectionDriverName</name>
7 <value>com.mysql.jdbc.Driver</value>
8</property>
9<property>
10 <name>javax.jdo.option.ConnectionUserName</name>
11 <value>hive</value>
12</property>
13<property>
14 <name>javax.jdo.option.ConnectionPassword</name>
15 <value>hive</value>
16</property>
17<property>
18 <name>hive.metastore.event.db.notification.api.auth</name>
19 <value>false</value>
20</property>
21<property>
22 <name>fs.s3a.access.key</name>
23 <value>S3_ACCESS_KEY</value>
24</property>
25<property>
26 <name>fs.s3a.secret.key</name>
27 <value>S3_SECRET_KEY</value>
28</property>
29<property>
30 <name>fs.s3a.connection.ssl.enabled</name>
31 <value>false</value>
32</property>
33<property>
34 <name>fs.s3a.path.style.access</name>
35 <value>true</value>
36</property>
37<property>
38 <name>fs.s3a.endpoint</name>
39 <value>S3_ENDPOINT</value>
40</property>
fs.s3a.endpoint 是指s3 的endponit,bos的s3 endpoint在BOS S3域名可查. 启动hive metastore:
1/usr/local/metastore/bin/start-metastore &
presto配置
preto的版本这里不作改变,hive.properties的内容改为:
1connector.name=hive-hadoop2
2hive.metastore.uri=thrift://localhost:9083
3hive.s3.path-style-access=true
4hive.s3.endpoint=S3_ENDPOINT
5hive.s3.aws-access-key=S3_ACCESS_KEY
6hive.s3.aws-secret-key=S3_SECRET_KEY
7hive.s3.ssl.enabled=false
启动presto
1/usr/local/trino/bin/launcher start
#创建schema
1CREATE SCHEMA IF NOT EXISTS hive.iris
2WITH (location = 's3a://my-bos-bucket/'); //注意,这里的location就是以s3a开头,如果bucket为my-bos-bucket,则location应为s3a://my-bos-bucket/开头
3#创建表
4CREATE TABLE IF NOT EXISTS hive.iris.iris_parquet (
5 sepal_length DOUBLE,
6 sepal_width DOUBLE,
7 petal_length DOUBLE,
8 petal_width DOUBLE,
9 class VARCHAR
10)
11WITH (
12 external_location = 's3a://my-bos-bucket/iris_parquet',
13 format = 'PARQUET'
14);
15
16SELECT
17 sepal_length,
18 class
19FROM hive.iris.iris_parquet
20LIMIT 10;
