Iceberg6: Iceberg+Spark

Iceberg6: Iceberg+Spark

准备

Hadoop
3.3.2
Spark
3.3.0
scala
2.13.8

配置Spark

  1. 下载jar包
notion image
  1. 下载好的iceberg-spark-runtime-3.3_2.12-0.14.0.jar包放入/spark/jars 目录下
  1. 配置/spark/conf/spark-defaults.conf 文件
root@redis01:/usr/local/spark/conf# vim spark-defaults.conf
spark.sql.catalog.hadoop_prod = org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.hadoop_prod.type = hadoop spark.sql.catalog.hadoop_prod.warehouse = hdfs://redis01:8020/spark/warehouse spark.sql.catalog.catalog-name.type = hadoop spark.sql.catalog.catalog-name.default-namespace = db spark.sql.catalog.catalog-name.warehouse = hdfs://redis01:8020/spark/warehouse
notion image
分发 iceberg-spark-runtime-3.3_2.12-0.14.0.jar包、配置spark-defaults.conf到其他节点
root@redis01:/usr/local/spark/jars# xsync iceberg-spark-runtime-3.3_2.12-0.14.0.jar root@redis01:/usr/local/spark/conf# xsync spark-defaults.conf

启动

root@redis01:/usr/local/spark/bin# ./spark-sql
查看数据库
# 这里执行show databases不会显示hadoop_prod这个库,直接使用即可 spark-sql> show databases;
notion image
# 切换到hadoop_prod spark-sql> use hadoop_prod; # 创建表 spark-sql> create database db; # 使用db spark-sql> use db; # 插入表 create table testA( id bigint, name string, age int, dt string) USING iceberg PARTITIONED by(dt);
notion image
 
# 插入数据 insert into testA values(1,"henggao",18,'2022-09-01'); # 查询 select * from testA;
notion image
 

查看HDFS

notion image

其他操作

overwrite操作

当向表中执行overwrite覆盖操作是,与 hive 一样,会将原始数据重新刷新。注意覆盖的不是整张表的数据,只是这个分区的数据。被覆盖的数据不会被物理删除,还会存在HDFS上,除非是对表进行drop操作。
 
Iceberg(二)对接Spark_Yuan_CSDF的博客-CSDN博客_iceberg spark
1、将构建好的Iceberg的spark模块jar包,复制到spark jars下 cp /opt/module/iceberg-apache-iceberg-0.11.1/spark3-extensions/build/libs/* /opt/module/spark-3.0.1-bin-hadoop2.7/jars/ cp /opt/module/iceberg-apache-iceberg-0.11.1/spark3-runtime/build/libs/* /opt/module/spark-3.0.1-bin-hadoop2.7/jars/ 2、配置spark参数,配置Spark Sql Catlog,可以用两种方式,基于hive和基于hadoop,这里先选择基于hadoop。修改spark conf目录下的spark-default.conf spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.hive_prod.type = hive spark.sql.catalog.hive_prod.uri = thrift://hadoop1:9083 spark.sql.catalog.hadoop_prod = org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.hadoop_prod.type = hadoop spark.sql.catalog.hadoop_prod.warehouse = hdfs://mycluster/spark/warehouse spark.sql.catalog.catalog-name.type = hadoop spark.sql.catalog.catalog-name.default-namespace = db spark.sql.catalog.catalog-name.uri = thrift://hadoop1:9083 spark.sql.catalog.catalog-name.warehouse= hdfs://mycluster/spark/warehouse 使用spark sql创建iceberg表,配置完毕后,会多出一个hadoop_prod.db数据库,但是注意这个数据库通过show tables是看不到的。可以直接通过use hadoop_prod;没有报错则表示配置成功。然后可以create database db;并通过以下命令建表: use hadoop_prod.db; create
Iceberg(二)对接Spark_Yuan_CSDF的博客-CSDN博客_iceberg spark