Doris整合Iceberg+Spark联邦查询实战
环境介绍:
| Soft | Verison |
|---|---|
| Hadoop | 3.3 |
| Hive | 3.1.3 |
| Doris | 1.1-pr2 |
| Spark | 3.1.2 |
| Iceberg | 0.13.2 |
Hadoop HDFS 环境安装
-
Hadoop3.3 下载
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.3/hadoop-3.3.3.tar.gz -
Lacalhost SSH 免密登录
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys ssh localhost执行 ssh localhost,会弹出安全提示,填写yes即可。
如果是docker环境下,需要手动启动sshd服务。
/usr/sbin/sshd -
core-site.xml配置
[root@17a5da45700b hadoop]# cat core-site.xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.root.groups</name> <value>*</value> </property> </configuration> -
hdfs-site.xml配置
[root@17a5da45700b hadoop]# cat hdfs-site.xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> -
hadoop-env.sh配置:添加如下配置到hadoop-env.sh。
export JAVA_HOME=/opt/software/jdk8 export HDFS_NAMENODE_USER=root export HDFS_DATANODE_USER=root export HDFS_SECONDARYNAMENODE_USER=root export YARN_RESOURCEMANAGER_USER=root export YARN_NODEMANAGER_USER=root -
格式化hdfs文件系统
bin/hdfs namenode -format -
启动hdfs
sbin/start-dfs.sh -
创建HDFS目录
bin/hdfs dfs -mkdir /user bin/hdfs dfs -mkdir /user/root -
查看HDFS目录,确保HDFS服务正常。
[root@17a5da45700b hadoop-3.3.3]# bin/hdfs dfs -ls /user Found 2 items drwxr-xr-x - root supergroup 0 2022-06-21 03:00 /user/hive drwxr-xr-x - root supergroup 0 2022-06-15 09:38 /user/root -
Hadoop环境变量配置。
export HADOOP_HOME=/opt/software/hadoop/hadoop-3.3.3 export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_HDFS_HOME=$HADOOP_HOME export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Hive环境安装
-
下载并安装Hive
wgt https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz tar -xzvf apache-hive-3.1.3-bin.tar.gz cd apache-hive-3.1.3-bin -
Hive环境配置
export HIVE_HOME=/opt/software/hive/apache-hive-3.1.3-bin export PATH=$HIVE_HOME/bin:$PATH -
在Hdfs上创建Hive目录并赋权。
$HADOOP_HOME/bin/hadoop fs -mkdir /tmp $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse -
hive-env.sh添加HADOOP_HOME配置
#创建hive-env.sh文件 mv hive-env.sh.template hive-env.sh #添加HADOOP_HOME配置 HADOOP_HOME=/opt/software/hadoop/hadoop-3.3.3 -
hive-site.xml配置:这里新建hive-site.xml配置文件
vim hive-site.xml [root@17a5da45700b conf]# cat hive-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hive? createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>root</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> <description>location of default database for the warehouse</description> </property> <property> <name>hive.metastore.uris</name> <value>thrift://127.0.0.1:9083</value> <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description> </property> <property> <name>javax.jdo.PersistenceManagerFactoryClass</name> <value>org.datanucleus.api.jdo.JDOPersistenceManagerFactory</value> </property> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <property> <name>datanucleus.schema.autoCreateAll</name> <value>true</value> </property> </configuration> -
在MySQL侧创建hive数据库,用户存储Hive元数据信息。
mysql --host=localhost --user=root --password=root
create database hive;
use hive;
- 添加MySQL连接到apache-hive-3.1.3-bin/lib
cd /opt/software/hive/apache-hive-3.1.3-bin/lib
wegt https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.49/mysql-connector-java-5.1.49.jar
📢:注意mysql-connector-java-5.1.49.jar的版本要和MySQL的大版本匹配,如mysql-connector版本大于MySQL很可能会报错。
我这里MySQL版本为5.7.38:
mysql> show variables like '%version%'
-> ;
+-------------------------+------------------------------+
| Variable_name | Value |
+-------------------------+------------------------------+
| innodb_version | 5.7.38 |
| protocol_version | 10 |
| slave_type_conversions | |
| tls_version | TLSv1,TLSv1.1,TLSv1.2 |
| version | 5.7.38 |
| version_comment | MySQL Community Server (GPL) |
| version_compile_machine | x86_64 |
| version_compile_os | Linux |
+-------------------------+------------------------------+
8 rows in set (0.00 sec)
mysql>
- Hive元数据初始化。
schematool -initSchema -dbType mysql
- 在MySQL中查看Hive元数据
mysql> show tables;
+-------------------------------+
| Tables_in_hive |
+-------------------------------+
| AUX_TABLE |
| BUCKETING_COLS |
| CDS |
| COLUMNS_V2 |
| COMPACTION_QUEUE |
| COMPLETED_COMPACTIONS |
| COMPLETED_TXN_COMPONENTS |
| CTLGS |
| DATABASE_PARAMS |
| DBS |
| DB_PRIVS |
| DELEGATION_TOKENS |
| FUNCS |
| FUNC_RU |
| GLOBAL_PRIVS |
| HIVE_LOCKS |
| IDXS |
| INDEX_PARAMS |
| I_SCHEMA |
| KEY_CONSTRAINTS |
| MASTER_KEYS |
| MATERIALIZATION_REBUILD_LOCKS |
| METASTORE_DB_PROPERTIES |
| MIN_HISTORY_LEVEL |
| MV_CREATION_METADATA |
| MV_TABLES_USED |
| NEXT_COMPACTION_QUEUE_ID |
| NEXT_LOCK_ID |
| NEXT_TXN_ID |
| NEXT_WRITE_ID |
| NOTIFICATION_LOG |
| NOTIFICATION_SEQUENCE |
| NUCLEUS_TABLES |
| PARTITIONS |
| PARTITION_EVENTS |
| PARTITION_KEYS |
| PARTITION_KEY_VALS |
| PARTITION_PARAMS |
| PART_COL_PRIVS |
| PART_COL_STATS |
| PART_PRIVS |
| REPL_TXN_MAP |
| ROLES |
| ROLE_MAP |
| RUNTIME_STATS |
| SCHEMA_VERSION |
| SDS |
| SD_PARAMS |
| SEQUENCE_TABLE |
| SERDES |
| SERDE_PARAMS |
| SKEWED_COL_NAMES |
| SKEWED_COL_VALUE_LOC_MAP |
| SKEWED_STRING_LIST |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES |
| SORT_COLS |
| TABLE_PARAMS |
| TAB_COL_STATS |
| TBLS |
| TBL_COL_PRIVS |
| TBL_PRIVS |
| TXNS |
| TXN_COMPONENTS |
| TXN_TO_WRITE_ID |
| TYPES |
| TYPE_FIELDS |
| VERSION |
| WM_MAPPING |
| WM_POOL |
| WM_POOL_TO_TRIGGER |
| WM_RESOURCEPLAN |
| WM_TRIGGER |
| WRITE_SET |
+-------------------------------+
74 rows in set (0.00 sec)
-
启动HiveServer2和Metastore
hive --service hiveserver2 >/dev/null 2>/dev/null & hive --service metastore 1>/dev/null 2>&1 & -
查看进程
[root@17a5da45700b lib ~]# jobs [1]- Running hive --service hiveserver2 > /dev/null 2> /dev/null & [2]+ Running hive --service metastore > /dev/null 2>&1 & -
查看metastore端口:
[root@17a5da45700b lib]# netstat -ano |grep 9083 tcp 0 0 0.0.0.0:9083 0.0.0.0:* LISTEN off (0.00/0/0)
Doris环境安装
-
Doris安装参照官网:https://doris.apache.org/zh-CN/docs/get-starting/get-starting.html#%E5%8D%95%E6%9C%BA%E9%83%A8%E7%BD%B2
-
Doris和Hadoop有一些端口冲突,需要对Doirs默认端口进行修改。
vim be/conf/be.conf将webserver_port = 8040修改为webserver_port = 18040
vim fe/conf/fe.conf将http_port = 8030修改为http_port = 18030
Spark环境安装
-
Spark下载:
wget https://dlcdn.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz tar -xzvf spark-3.1.2-bin-hadoop3.2.tgz -
配置Spark环境变量
export SPARK_HOME=/opt/software/spark-3.1.2-bin-hadoop3.2 export PATH=$PATH:$SPARK_HOME/bin -
将hive-site.xml复制到Spark conf目录
cp /opt/software/hive/apache-hive-3.1.3-bin/conf/hive-site.xml /opt/software/spark-3.1.2-bin-hadoop3.2/conf -
添加MySQL驱动、hive依赖、Iceberg依赖到Spark
cd /opt/software/spark-3.1.2-bin-hadoop3.2/jars/ wegt https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.49/mysql-connector-java-5.1.49.jar wget https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.1_2.12/0.13.2/iceberg-spark-runtime-3.1_2.12-0.13.2.jar wget https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/0.13.2/iceberg-hive-runtime-0.13.2.jar注意:这里下载完可能出现如下的文件名,需要我们使用mv命令将其修改为正确的Jar包名称:
remotecontent?filepath=org%2Fapache%2Ficeberg%2Ficeberg-spark-runtime-3.1_2.12%2F0.13.2%2Ficeberg-spark-runtime-3.1_2.12-0.13.2.jar
Iceberg On Spark建库建表
- 连接Spark-sql
spark-sql --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.hive_prod.type=hive \
--conf spark.sql.catalog.hive_prod.uri=thrift://127.0.0.1:9083
- 启用Iceberg hive支持
SET iceberg.engine.hive.enabled=true;
- 创建数据库和表:
ADD JAR /opt/software/hive/apache-hive-3.1.3-bin/lib/iceberg-hive-runtime-0.13.2.jar;
ADD JAR /opt/software/hive/apache-hive-3.1.3-bin/lib/iceberg-spark-runtime-3.1_2.12-0.13.2.jar;
create database hive_prod.testdb;
use hive_prod.testdb;
CREATE TABLE hive_prod.testdb.test_table (id bigint, data string) USING iceberg;
- 插入数据:
INSERT INTO hive_prod.testdb.test_table VALUES (1, 'a'), (2, 'b'), (3, 'c');
Doris创建Iceberg外表和读取Iceberg数据
- 登录Doris
mysql -h 127.0.0.1 -P 9030 -uroot
- 创建数据库和表:
CREATE DATABASE test_iceberg_db PROPERTIES ( "iceberg.database" = "testdb", "iceberg.hive.metastore.uris" = "thrift://127.0.0.1:9083", "iceberg.catalog.type" = "HIVE_CATALOG" )
-
Doris数据查询。
use test_iceberg_db mysql> show tables; +----------------------------+ | Tables_in_test_iceberg_db | +----------------------------+ | test_table | +----------------------------+ 1 row in set (0.00 sec) mysql> select * from test_table; +------+------+ | id | data | +------+------+ | 1 | a | | 2 | b | | 3 | c | +------+------+ 3 rows in set (0.40 sec)
本文档详述了如何在Hadoop 3.3、Hive 3.1.3、Doris 1.1-pr2、Spark 3.1.2和Iceberg 0.13.2环境下配置和使用Doris进行Iceberg表的联邦查询。步骤包括Hadoop、Hive、Doris和Spark的安装、配置,以及Hive元数据的初始化,最后展示了Doris如何读取Iceberg数据。

840

被折叠的 条评论
为什么被折叠?



