Doris整合Iceberg+Spark联邦查询实战

本文档详述了如何在Hadoop 3.3、Hive 3.1.3、Doris 1.1-pr2、Spark 3.1.2和Iceberg 0.13.2环境下配置和使用Doris进行Iceberg表的联邦查询。步骤包括Hadoop、Hive、Doris和Spark的安装、配置,以及Hive元数据的初始化,最后展示了Doris如何读取Iceberg数据。

Doris整合Iceberg+Spark联邦查询实战

环境介绍:
SoftVerison
Hadoop3.3
Hive3.1.3
Doris1.1-pr2
Spark3.1.2
Iceberg0.13.2
Hadoop HDFS 环境安装
  1. Hadoop3.3 下载

    wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.3/hadoop-3.3.3.tar.gz
    
  2. Lacalhost SSH 免密登录

      ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
      cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
      chmod 0600 ~/.ssh/authorized_keys
      ssh localhost
    

    执行 ssh localhost,会弹出安全提示,填写yes即可。

    如果是docker环境下,需要手动启动sshd服务。

     /usr/sbin/sshd
    
  3. core-site.xml配置

    [root@17a5da45700b hadoop]# cat core-site.xml
    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!--
      Licensed under the Apache License, Version 2.0 (the "License");
      you may not use this file except in compliance with the License.
      You may obtain a copy of the License at
    
        http://www.apache.org/licenses/LICENSE-2.0
    
      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      See the License for the specific language governing permissions and
      limitations under the License. See accompanying LICENSE file.
    -->
    
    <!-- Put site-specific property overrides in this file. -->
    
    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://localhost:9000</value>
        </property>
    
    <property>
      <name>hadoop.proxyuser.root.hosts</name>
      <value>*</value>
    </property>
    
    <property>
      <name>hadoop.proxyuser.root.groups</name>
      <value>*</value>
    </property>
    
    </configuration>
    
  4. hdfs-site.xml配置

    [root@17a5da45700b hadoop]# cat  hdfs-site.xml
    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!--
      Licensed under the Apache License, Version 2.0 (the "License");
      you may not use this file except in compliance with the License.
      You may obtain a copy of the License at
    
        http://www.apache.org/licenses/LICENSE-2.0
    
      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      See the License for the specific language governing permissions and
      limitations under the License. See accompanying LICENSE file.
    -->
    
    <!-- Put site-specific property overrides in this file. -->
    
    <configuration>
        <property>
            <name>dfs.replication</name>
            <value>1</value>
        </property>
    </configuration>
    
  5. hadoop-env.sh配置:添加如下配置到hadoop-env.sh。

    export JAVA_HOME=/opt/software/jdk8
    export HDFS_NAMENODE_USER=root
    export HDFS_DATANODE_USER=root
    export HDFS_SECONDARYNAMENODE_USER=root
    export YARN_RESOURCEMANAGER_USER=root
    export YARN_NODEMANAGER_USER=root
    
  6. 格式化hdfs文件系统

    bin/hdfs namenode -format
    
  7. 启动hdfs

    sbin/start-dfs.sh
    
  8. 创建HDFS目录

     bin/hdfs dfs -mkdir /user
     bin/hdfs dfs -mkdir /user/root	
    
  9. 查看HDFS目录,确保HDFS服务正常。

    [root@17a5da45700b hadoop-3.3.3]#  bin/hdfs dfs -ls /user
    Found 2 items
    drwxr-xr-x   - root supergroup          0 2022-06-21 03:00 /user/hive
    drwxr-xr-x   - root supergroup          0 2022-06-15 09:38 /user/root
    
  10. Hadoop环境变量配置。

    export HADOOP_HOME=/opt/software/hadoop/hadoop-3.3.3
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    export HADOOP_HDFS_HOME=$HADOOP_HOME
    export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
    
Hive环境安装
  1. 下载并安装Hive

    wgt https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
    tar -xzvf apache-hive-3.1.3-bin.tar.gz
    cd apache-hive-3.1.3-bin
    
    
  2. Hive环境配置

    export HIVE_HOME=/opt/software/hive/apache-hive-3.1.3-bin
    export PATH=$HIVE_HOME/bin:$PATH
    
  3. 在Hdfs上创建Hive目录并赋权。

    $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
    $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
    $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
    $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse
    
  4. hive-env.sh添加HADOOP_HOME配置

    #创建hive-env.sh文件
    mv hive-env.sh.template hive-env.sh
    #添加HADOOP_HOME配置
    HADOOP_HOME=/opt/software/hadoop/hadoop-3.3.3
    
  5. hive-site.xml配置:这里新建hive-site.xml配置文件

    vim hive-site.xml
    [root@17a5da45700b conf]# cat hive-site.xml
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
            <property>
                <name>javax.jdo.option.ConnectionURL</name>
                <value>jdbc:mysql://localhost:3306/hive?
    createDatabaseIfNotExist=true</value>
            </property>
            <property>
                <name>javax.jdo.option.ConnectionDriverName</name>
                <value>com.mysql.jdbc.Driver</value>
            </property>
            <property>
                <name>javax.jdo.option.ConnectionUserName</name>
                <value>root</value>
    </property>
    <property>
                <name>javax.jdo.option.ConnectionPassword</name>
                <value>root</value>
            </property>
            <property>
                    <name>hive.metastore.warehouse.dir</name>
                    <value>/user/hive/warehouse</value>
                    <description>location of default database for the
    warehouse</description>
            </property>
            <property>
                    <name>hive.metastore.uris</name>
                     <value>thrift://127.0.0.1:9083</value>  
                    <description>Thrift URI for the remote metastore. Used by metastore
    client to connect to remote metastore.</description>
            </property>
            <property>
                    <name>javax.jdo.PersistenceManagerFactoryClass</name>
                    <value>org.datanucleus.api.jdo.JDOPersistenceManagerFactory</value>
            </property>
            <property>
                    <name>hive.metastore.schema.verification</name>
                    <value>false</value>
            </property>
            <property>
                    <name>datanucleus.schema.autoCreateAll</name>
                    <value>true</value>
            </property>
    </configuration>
    
  6. 在MySQL侧创建hive数据库,用户存储Hive元数据信息。

mysql --host=localhost --user=root --password=root
create database hive;
use hive;
  1. 添加MySQL连接到apache-hive-3.1.3-bin/lib
cd /opt/software/hive/apache-hive-3.1.3-bin/lib
wegt https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.49/mysql-connector-java-5.1.49.jar

📢:注意mysql-connector-java-5.1.49.jar的版本要和MySQL的大版本匹配,如mysql-connector版本大于MySQL很可能会报错。

我这里MySQL版本为5.7.38:

mysql> show variables like '%version%'
    -> ;
+-------------------------+------------------------------+
| Variable_name           | Value                        |
+-------------------------+------------------------------+
| innodb_version          | 5.7.38                       |
| protocol_version        | 10                           |
| slave_type_conversions  |                              |
| tls_version             | TLSv1,TLSv1.1,TLSv1.2        |
| version                 | 5.7.38                       |
| version_comment         | MySQL Community Server (GPL) |
| version_compile_machine | x86_64                       |
| version_compile_os      | Linux                        |
+-------------------------+------------------------------+
8 rows in set (0.00 sec)

mysql> 
  1. Hive元数据初始化。
schematool -initSchema -dbType mysql
  1. 在MySQL中查看Hive元数据
mysql> show tables;
+-------------------------------+
| Tables_in_hive                |
+-------------------------------+
| AUX_TABLE                     |
| BUCKETING_COLS                |
| CDS                           |
| COLUMNS_V2                    |
| COMPACTION_QUEUE              |
| COMPLETED_COMPACTIONS         |
| COMPLETED_TXN_COMPONENTS      |
| CTLGS                         |
| DATABASE_PARAMS               |
| DBS                           |
| DB_PRIVS                      |
| DELEGATION_TOKENS             |
| FUNCS                         |
| FUNC_RU                       |
| GLOBAL_PRIVS                  |
| HIVE_LOCKS                    |
| IDXS                          |
| INDEX_PARAMS                  |
| I_SCHEMA                      |
| KEY_CONSTRAINTS               |
| MASTER_KEYS                   |
| MATERIALIZATION_REBUILD_LOCKS |
| METASTORE_DB_PROPERTIES       |
| MIN_HISTORY_LEVEL             |
| MV_CREATION_METADATA          |
| MV_TABLES_USED                |
| NEXT_COMPACTION_QUEUE_ID      |
| NEXT_LOCK_ID                  |
| NEXT_TXN_ID                   |
| NEXT_WRITE_ID                 |
| NOTIFICATION_LOG              |
| NOTIFICATION_SEQUENCE         |
| NUCLEUS_TABLES                |
| PARTITIONS                    |
| PARTITION_EVENTS              |
| PARTITION_KEYS                |
| PARTITION_KEY_VALS            |
| PARTITION_PARAMS              |
| PART_COL_PRIVS                |
| PART_COL_STATS                |
| PART_PRIVS                    |
| REPL_TXN_MAP                  |
| ROLES                         |
| ROLE_MAP                      |
| RUNTIME_STATS                 |
| SCHEMA_VERSION                |
| SDS                           |
| SD_PARAMS                     |
| SEQUENCE_TABLE                |
| SERDES                        |
| SERDE_PARAMS                  |
| SKEWED_COL_NAMES              |
| SKEWED_COL_VALUE_LOC_MAP      |
| SKEWED_STRING_LIST            |
| SKEWED_STRING_LIST_VALUES     |
| SKEWED_VALUES                 |
| SORT_COLS                     |
| TABLE_PARAMS                  |
| TAB_COL_STATS                 |
| TBLS                          |
| TBL_COL_PRIVS                 |
| TBL_PRIVS                     |
| TXNS                          |
| TXN_COMPONENTS                |
| TXN_TO_WRITE_ID               |
| TYPES                         |
| TYPE_FIELDS                   |
| VERSION                       |
| WM_MAPPING                    |
| WM_POOL                       |
| WM_POOL_TO_TRIGGER            |
| WM_RESOURCEPLAN               |
| WM_TRIGGER                    |
| WRITE_SET                     |
+-------------------------------+
74 rows in set (0.00 sec)


  1. 启动HiveServer2和Metastore

    hive --service hiveserver2  >/dev/null 2>/dev/null &
    hive --service metastore  1>/dev/null  2>&1  &
    
  2. 查看进程

    [root@17a5da45700b lib ~]# jobs
    [1]-  Running                 hive --service hiveserver2 > /dev/null 2> /dev/null &
    [2]+  Running                 hive --service metastore > /dev/null 2>&1 &
    
    
  3. 查看metastore端口:

    [root@17a5da45700b lib]# netstat -ano |grep 9083
    tcp        0      0 0.0.0.0:9083            0.0.0.0:*               LISTEN      off (0.00/0/0)
    
Doris环境安装
  1. Doris安装参照官网:https://doris.apache.org/zh-CN/docs/get-starting/get-starting.html#%E5%8D%95%E6%9C%BA%E9%83%A8%E7%BD%B2

  2. Doris和Hadoop有一些端口冲突,需要对Doirs默认端口进行修改。

    vim be/conf/be.conf
    

    将webserver_port = 8040修改为webserver_port = 18040

    vim fe/conf/fe.conf 
    

    将http_port = 8030修改为http_port = 18030

Spark环境安装
  1. Spark下载:

    wget https://dlcdn.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
    tar -xzvf spark-3.1.2-bin-hadoop3.2.tgz
    
  2. 配置Spark环境变量

    export SPARK_HOME=/opt/software/spark-3.1.2-bin-hadoop3.2
    export PATH=$PATH:$SPARK_HOME/bin
    
  3. 将hive-site.xml复制到Spark conf目录

    cp /opt/software/hive/apache-hive-3.1.3-bin/conf/hive-site.xml /opt/software/spark-3.1.2-bin-hadoop3.2/conf
    
  4. 添加MySQL驱动、hive依赖、Iceberg依赖到Spark

    cd /opt/software/spark-3.1.2-bin-hadoop3.2/jars/
    wegt https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.49/mysql-connector-java-5.1.49.jar
    wget https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.1_2.12/0.13.2/iceberg-spark-runtime-3.1_2.12-0.13.2.jar
    wget https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/0.13.2/iceberg-hive-runtime-0.13.2.jar
    

    注意:这里下载完可能出现如下的文件名,需要我们使用mv命令将其修改为正确的Jar包名称:

    remotecontent?filepath=org%2Fapache%2Ficeberg%2Ficeberg-spark-runtime-3.1_2.12%2F0.13.2%2Ficeberg-spark-runtime-3.1_2.12-0.13.2.jar
    
Iceberg On Spark建库建表
  1. 连接Spark-sql

spark-sql  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.hive_prod.type=hive \
    --conf spark.sql.catalog.hive_prod.uri=thrift://127.0.0.1:9083
  1. 启用Iceberg hive支持
SET iceberg.engine.hive.enabled=true;
  1. 创建数据库和表:
ADD JAR /opt/software/hive/apache-hive-3.1.3-bin/lib/iceberg-hive-runtime-0.13.2.jar;
ADD JAR /opt/software/hive/apache-hive-3.1.3-bin/lib/iceberg-spark-runtime-3.1_2.12-0.13.2.jar;
create database hive_prod.testdb;
use hive_prod.testdb;
CREATE TABLE hive_prod.testdb.test_table (id bigint, data string) USING iceberg;
  1. 插入数据:
INSERT INTO hive_prod.testdb.test_table  VALUES (1, 'a'), (2, 'b'), (3, 'c');
Doris创建Iceberg外表和读取Iceberg数据
  1. 登录Doris
mysql -h 127.0.0.1 -P 9030 -uroot
  1. 创建数据库和表:
CREATE DATABASE test_iceberg_db PROPERTIES ( "iceberg.database" = "testdb", "iceberg.hive.metastore.uris" = "thrift://127.0.0.1:9083", "iceberg.catalog.type" = "HIVE_CATALOG" )
  1. Doris数据查询。

    use test_iceberg_db
    mysql> show tables;
    +----------------------------+
    | Tables_in_test_iceberg_db |
    +----------------------------+
    | test_table                 |
    +----------------------------+
    1 row in set (0.00 sec)
    
    mysql> select * from test_table;
    +------+------+
    | id   | data |
    +------+------+
    |    1 | a    |
    |    2 | b    |
    |    3 | c    |
    +------+------+
    3 rows in set (0.40 sec)
    
    
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值