大數據 Hive 筆記大全 收藏+轉發+關注

Apache Hive


大數據 Hive 筆記大全 收藏+轉發+關注


一、概述

數據倉庫:英文名稱為Data Warehouse,可簡寫為DW或DWH。數據倉庫,是為企業所有級別的決策制定過程,提供所有類型數據支持的戰略集合。它是單個數據存儲,出於分析性報告和決策支持目的而創建。 為需要業務智能的企業,提供指導業務流程改進、監視時間、成本、質量以及控制。


Apache Hive是基於Hadoop的一個數據倉庫工具,可以將結構化的數據文件映射為一張數據庫表,並提供簡單的類sql查詢功能,可以將sql語句轉換為MapReduce任務進行運行。其優點是學習成本低,可以通過類SQL語句快速實現簡單的MapReduce統計,不必開發專門的MapReduce應用,十分適合數據倉庫的統計分析。

Hive是建立在 Hadoop 上的數據倉庫基礎構架。它提供了一系列的工具,可以用來進行數據提取轉化加載(ETL),這是一種可以存儲、查詢和分析存儲在 Hadoop 中的大規模數據的機制。Hive定義了簡單的類 SQL查詢語言,稱為 HQL,它允許熟悉 SQL 的用戶查詢數據。同時,這個語言也允許熟悉MapReduce 開發者的開發自定義的 mapper 和 reducer 來處理內建的 mapper 和 reducer 無法完成的複雜的分析工作。Hive 沒有專門的數據格式。 Hive 可以很好的工作在 Thrift 之上,控制分隔符,也允許用戶指定數據格式。

Note

ETL:大數據中的一個專業術語, E: Extract(抽取) T:Transfer(轉換) L:Load(加載)

ETL指的是從數據源到數據倉庫的處理過程, E:將數據源中的數據按照一些規則提取出來關鍵某些數據,T:將數據做一些簡單格式轉換,存放在數據倉庫的臨時表中,L:將臨時表中的數據按照業務需求裝載到數據倉庫的業務表中;

適用場景

Hive 構建在基於靜態批處理的Hadoop 之上,Hadoop 通常都有較高的延遲並且在作業提交和調度的時候需要大量的開銷。因此,Hive 並不能夠在大規模數據集上實現低延遲快速的查詢,例如,Hive 在幾百MB 的數據集上執行查詢一般有分鐘級的時間延遲。因此,Hive 並不適合那些需要低延遲的應用,例如,聯機事務處理(OLTP)。Hive 查詢操作過程嚴格遵守Hadoop MapReduce 的作業執行模型,Hive將用戶的HiveQL 語句通過解釋器轉換為MapReduce 作業提交到Hadoop 集群上,Hadoop 監控作業執行過程,然後返回作業執行結果給用戶。Hive並非為聯機事務處理而設計,Hive並不提供實時的查詢和基於行級的數據更新操作。Hive的最佳使用場合是大數據集的批處理作業,例如,網絡日誌分析。

特點

Hive 是一種底層封裝了Hadoop 的數據倉庫處理工具,使用類SQL 的HiveQL 語言實現數據查詢,所有Hive 的數據都存儲在Hadoop 兼容的文件系統例如(HDFS) Hive 在加載數據過程中不會對數據進行任何的修改,只是將數據移動到HDFS 中Hive 設定的目錄下,因此,Hive 不支持對數據的改寫和添加,所有的數據都是在加載的時候確定的。

  • 支持索引,加快數據查詢
  • 不同的存儲類型,例如,純文本文件、HBase 中的文件。
  • 將元數據保存在關係數據庫中,大大減少了在查詢過程中執行語義檢查的時間。
  • 可以直接使用存儲在Hadoop 文件系統中的數據。
  • 內置大量用戶函數UDF 來操作時間、字符串和其他的數據挖掘工具,支持用戶擴展UDF 函數來完成內置函數無法實現的操作。
  • 類SQL 的查詢方式,將SQL 查詢轉換為MapReduce 的job 在Hadoop集群上執行。

數據類型

首先Hive沒有專門的數據存儲格式,也沒有為數據建立索引,用戶可以非常自由的組織 Hive 中的表,只需要在創建表的時候告訴 Hive 數據中的列分隔符和行分隔符,Hive 就可以解析數據。其次Hive 中所有的數據都存儲在 HDFS 中,Hive 中包含以下數據模型:表(Table,也稱為內部表),外部表(External Table),分區(Partition),分桶表(Bucket)


二、環境搭建

準備工作

  • MySQL DB(Hive使用關係型數據庫 存放元數據,減少語義檢查查詢,需要開啟遠程訪問支持)則使用windows的mysql數據庫服務 (省略)mysql –u root –proot
    use mysql;
    update user set host = '%' where host = 'localhost'
    GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'root' WITH GRANT OPTION;
    FLUSH PRIVILEGES;
    exit;
  • Hadoop(HDFS & Yarn集群)服務健康[root@HadoopNode00 ~]# start-dfs.sh
    Starting namenodes on [HadoopNode00]
    HadoopNode00: starting namenode, logging to /home/hadoop/hadoop-2.6.0/logs/hadoop-root-namenode-HadoopNode00.out
    localhost: starting datanode, logging to /home/hadoop/hadoop-2.6.0/logs/hadoop-root-datanode-HadoopNode00.out
    Starting secondary namenodes [0.0.0.0]
    0.0.0.0: starting secondarynamenode, logging to /home/hadoop/hadoop-2.6.0/logs/hadoop-root-secondarynamenode-HadoopNode00.out
    [root@HadoopNode00 ~]#
    [root@HadoopNode00 ~]#
    [root@HadoopNode00 ~]# start-yarn.sh
    starting yarn daemons
    starting resourcemanager, logging to /home/hadoop/hadoop-2.6.0/logs/yarn-root-resourcemanager-HadoopNode00.out
    localhost: starting nodemanager, logging to /home/hadoop/hadoop-2.6.0/logs/yarn-root-nodemanager-HadoopNode00.out
    [root@HadoopNode00 ~]# jps
    1858 DataNode
    1765 NameNode
    2618 Jps
    2204 ResourceManager
    2046 SecondaryNameNode
    2302 NodeManager
  • JDK8.0 以上

安裝

上傳安裝包

解壓縮安裝

<code>[root@HadoopNode00 ~]# tar -zxf apache-hive-1.2.1-bin.tar.gz -C /usr/<code>

配置

新建hive-site.xml

<code>[root@HadoopNode00 conf]# vi hive-site.xml

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL/<name>
<value>jdbc:mysql://192.168.197.1:3306/hive/<value>
/<property>
<property>
<name>javax.jdo.option.ConnectionDriverName/<name>
<value>com.mysql.jdbc.Driver/<value>
/<property>
<property>
<name>javax.jdo.option.ConnectionUserName/<name>
<value>root/<value>
/<property>
<property>
<name>javax.jdo.option.ConnectionPassword/<name>
<value>1234/<value>
/<property>
/<configuration>/<code>

注意:

hive數據庫的編碼格式需要定義為拉丁

添加MySQL驅動jar包

注意版本匹配

<code>[root@HadoopNode00 apache-hive-1.2.1-bin]# mv /root/mysql-connector-java-5.1.6.jar /usr/apache-hive-1.2.1-bin/lib//<code>

替換Hadoop jline的低版本jar包

<code>[root@HadoopNode00 ~]# cp /usr/apache-hive-1.2.1-bin/lib/jline-2.12.jar /home/hadoop/hadoop-2.6.0/share/hadoop/yarn/lib/
[root@HadoopNode00 ~]# rm -rf /home/hadoop/hadoop-2.6.0/share/hadoop/yarn/lib/jline-0.9.94.jar/<code>


啟動Hive服務

單用戶訪問

在一個服務窗口,同時啟動Hive Server和Hive Client; 只能允許當前的Hive Client操作Hive Server

<code>[root@HadoopNode00 ~]# cd /usr/apache-hive-1.2.1-bin/
[root@HadoopNode00 apache-hive-1.2.1-bin]# bin/hive
Logging initialized using configuration in jar:file:/usr/apache-hive-1.2.1-bin/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive> show databases;
OK
default
Time taken: 0.669 seconds, Fetched: 1 row(s)
hive> use default;
OK
Time taken: 0.028 seconds
hive> show tables;
OK
Time taken: 0.024 seconds/<code>

多用戶訪問

首先啟動HiveServer,可以在另外窗口啟動多個Hive Client操作

<code>[root@HadoopNode00 apache-hive-1.2.1-bin]# bin/hiveserver2

[root@HadoopNode00 apache-hive-1.2.1-bin]# bin/beeline -u jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1 by Apache Hive
0: jdbc:hive2://localhost:10000> show databases;
+----------------+--+
| database_name |
+----------------+--+
| default |
+----------------+--+
1 row selected (1.07 seconds)
0: jdbc:hive2://localhost:10000> use default;
No rows affected (0.052 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+--+

| tab_name |
+-----------+--+
+-----------+--+
No rows selected (0.037 seconds)
0: jdbc:hive2://localhost:10000>/<code>

注意:

啟動Hive Server後會在MySQL中創建29張和元數據存儲相關的表

Hive會在HDFS中創建數據倉庫目錄,用以存放數據


三、數據庫和表相關操作

數據庫

創建數據庫

完整語法

<code>CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];/<code>

如:

<code>第一種寫法:
hive> create database if not exists baizhi;
OK
Time taken: 0.159 seconds

自動在hdfs創建數據庫的數據存放目錄: /user/hive/warehouse/baizhi.db

第二種寫法:
hive>

> create database test3;
OK

第三種完整寫法:
hive> create database if not exists test2 comment 'test2 database' location '/user/test2' with dbproperties('author'='gaozhy','company'='baizhiedu');
OK/<code>

刪除數據庫

完整語法

<code>DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];/<code>

默認是:RESTRICT 不允許刪除數據庫有表的庫

Cascade 刪除數據庫時級聯刪除表

如:

<code>hive> drop schema if exists test3 restrict;
Moved: 'hdfs://HadoopNode00:9000/user/hive/warehouse/test3.db' to trash at: hdfs://HadoopNode00:9000/user/root/.Trash/Current
OK
Time taken: 0.178 seconds
hive> drop database test2 cascade;
Moved: 'hdfs://HadoopNode00:9000/user/test2' to trash at: hdfs://HadoopNode00:9000/user/root/.Trash/Current
OK
Time taken: 0.101 seconds/<code>

查看數據庫

完整語法

<code>(DESC|DESCRIBE) (DATABASE|SCHEMA) database_name ;/<code>

如:

<code>hive> desc database baizhi;
OK
baizhi hdfs://HadoopNode00:9000/user/hive/warehouse/baizhi.db root USER
Time taken: 0.037 seconds, Fetched: 1 row(s)/<code>

修改數據庫

完整語法

<code>ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);
ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;/<code>

如:

<code>hive> desc database baizhi;
OK
baizhi hdfs://HadoopNode00:9000/user/hive/warehouse/baizhi.db zs USER
Time taken: 0.049 seconds, Fetched: 1 row(s)
hive> alter database baizhi set owner user root;
OK
Time taken: 0.026 seconds
hive> desc database baizhi;
OK
baizhi hdfs://HadoopNode00:9000/user/hive/warehouse/baizhi.db root USER
Time taken: 0.016 seconds, Fetched: 1 row(s)
/<code>

切換數據庫

完整語法

<code>hive> select current_database();
OK
default
Time taken: 0.585 seconds, Fetched: 1 row(s)
hive> use baizhi;
OK
Time taken: 0.021 seconds
hive> select current_database();
OK
baizhi
/<code>

展示數據庫列表

完整語法

<code>hive> show databases;
/<code>


Hive表中的數據類型

數據類型(primitive,array,map,struct)

  • Primitive(原始類型):整數:TINYINT、SMALLINT、INT、BIGINT (等價於Byte、Short、Int、Long存值範圍)布爾:BOOLEAN小數:FLOAT、DOUBLE字符:STRING、CHAR、VARCHAR二進制:BINARY時間類型:TIMESTAMP、DATE
  • Array(數組類型):ARRAY < data_type >
  • Map(key-value類型):MAP < primitive_type, data_type >
  • Struct(結構體類型):STRUCT <data>

Hive默認使用的分隔符

分隔符描述\\n對於文本來說,每一行都是一條記錄。因此\\n可以分割記錄。^A(Ctrl+a)用於分割字段(列),在create table中可以使用\\001表示。^B(Ctrl+b)用於分割array或者是struct中 的元素或者用於map結構中的k-v對的分隔符,在create table中可以使用\\002表示。^C(Ctrl+c)用於Map中k-v的分隔符,在create table中可以使用\\003表示。

分隔符在vi模式下,使用Ctrl +v + Ctrl + A|B|C


Hive表的使用

創建表的語法

標準語法

類似於DB的創建表的語法

<code>hive> create table t_user(id int, name varchar(50),sex boolean,birthday date);
OK
Time taken: 0.161 seconds
hive> show tables;
OK
t_user
/<code>

裝載數據

<code># 1. 準備數據文件,按照hive表的格式要求 準備數據
1^Azs^Atrue^A2018-01-01
2^Als^Afalse^A1998-07-07

# 2. hive指令將數據文件的內容裝載到Hive Table中 [本地文件系統]
hive> load data local inpath '/usr/apache-hive-1.2.1-bin/data/t_user.txt' into table t_user;
Loading data to table baizhi.t_user
Table baizhi.t_user stats: [numFiles=1, totalSize=43]
OK
Time taken: 0.299 seconds

# 3. hive指令將數據文件的內容追加裝載到Hive Table中 [HDFS文件系統]
hive > load data inpath 'hdfs://HadoopNode00:9000/t_user.txt' into table t_user;

Loading data to table baizhi.t_user
Table baizhi.t_user stats: [numFiles=2, totalSize=86]
OK
Time taken: 0.233 seconds
hive> select * from t_user;
OK
1 zs true 2018-01-01
2 ls false 1998-07-07
3 zs true 2018-01-01
4 ls false 1998-07-07

# 4. hive指令將數據文件的內容覆蓋裝載到Hive Table中 [HDFS文件系統]
hive> load data inpath 'hdfs://HadoopNode00:9000/t_user.txt' overwrite into table t_user;
Loading data to table baizhi.t_user
Moved: 'hdfs://HadoopNode00:9000/user/hive/warehouse/baizhi.db/t_user/t_user.txt' to trash at: hdfs://HadoopNode00:9000/user/root/.Trash/Current
Moved: 'hdfs://HadoopNode00:9000/user/hive/warehouse/baizhi.db/t_user/t_user_copy_1.txt' to trash at: hdfs://HadoopNode00:9000/user/root/.Trash/Current
Table baizhi.t_user stats: [numFiles=1, numRows=0, totalSize=43, rawDataSize=0]
OK
Time taken: 0.274 seconds/<code>

總結:

hive默認創建的表是一個內部表,數據文件在裝載時會移動拷貝到數據倉庫的表的存儲目錄;

hive表裝載數據時,可以是本地文件系統(local)中數據或者HDFS

hive表裝載數據時,默認採用的是追加(append); 如果需要覆蓋表的原始內容,在需要在裝載表的時候指定overwrite


數組類型的使用
<code># 1. 創建表
hive> create table t_person(id int,name string,hobbies array<string>);
OK
Time taken: 0.063 seconds



# 2. 準備數據文件
1^Azs^ATV^BLOL^BMUSIC
2^Als^ASPORT^BDrink

# 3. 裝載數據
hive> load data local inpath '/usr/apache-hive-1.2.1-bin/data/t_person.txt' into table t_person;
Loading data to table baizhi.t_person
Table baizhi.t_person stats: [numFiles=1, totalSize=35]
OK
Time taken: 0.197 seconds
hive> select * from t_person;
OK
1 zs ["TV","LOL","MUSIC"]
2 ls ["SPORT","Drink"]
Time taken: 0.053 seconds, Fetched: 2 row(s)
/<string>/<code>
結構化類型的使用
<code># 1. 創建表
hive> create table t_location(id tinyint,name string,address struct<string>);
OK
Time taken: 0.064 seconds

# 2. 準備數據文件
1^A三里屯^A中國^B北京朝陽
2^A五道口^A中國^B北京海淀

# 3. 裝載數據
hive> load data local inpath '/usr/apache-hive-1.2.1-bin/data/t_location.log' into table t_location;
Loading data to table baizhi.t_location
Table baizhi.t_location stats: [numFiles=1, totalSize=64]
OK
Time taken: 0.218 seconds
hive> select * from t_location;
OK
1 三里屯 {"country":"中國","city":"北京朝陽"}
2 五道口 {"country":"中國","city":"北京海淀"}
Time taken: 0.063 seconds, Fetched: 2 row(s)/<string>/<code>

注意:

struct type數據本質上由Json格式組織和管理;

Map類型的使用
<code># 1. 創建表
hive> create table t_product(id int,name varchar(50),tag map<string>);
OK
Time taken: 0.063 seconds


# 2. 準備數據文件
1^Aiphone11^Amemory^C256GB^Bsize^C5.8
2^Ahuawei mate30^Asize^C6.1

# 3. 加載數據
hive> load data local inpath '/usr/apache-hive-1.2.1-bin/data/t_product.txt' into table t_product;
Loading data to table baizhi.t_product
Table baizhi.t_product stats: [numFiles=2, totalSize=107]
OK
Time taken: 0.194 seconds
hive> select * from t_product;
OK
1 iphone11 {"memory":"256GB"}
2 huawei mate30 {"size":"6.1"}
1 iphone11 {"memory":"256GB","size":"5.8"}
2 huawei mate30 {"size":"6.1"}
Time taken: 0.076 seconds, Fetched: 4 row(s)/<string>/<code>


自定義分隔符

字段分隔符
<code># 1. 自定義字段的分隔符  空格
hive> create table tt_user(id int,name varchar(32),sex boolean,birth date) row format delimited fields terminated by ' ' lines terminated by '\\n';
OK
Time taken: 0.123 seconds


# 2. 準備數據文件
1 zs true 2018-01-01

2 ls false 2020-01-02
3 ww false 2020-01-01

# 3. 裝載數據時
hive> load data local inpath '/usr/apache-hive-1.2.1-bin/data/tt_user.txt' into table tt_user;
Loading data to table baizhi.tt_user
Table baizhi.tt_user stats: [numFiles=1, totalSize=65]
OK
Time taken: 0.228 seconds
hive> select * from tt_user;
OK
1 zs true 2018-01-01
2 ls false 2020-01-02
3 ww false 2020-01-01
Time taken: 0.05 seconds, Fetched: 3 row(s)/<code>
數組分隔符
<code># 1. 自定義字段和集合元素的分隔符  空格
hive> create table t_order(id int,name varchar(32),num int,price double,tags array<string>,user_id int)row format delimited fields terminated by ' ' collection items terminated by '>' lines terminated by '\\n';
OK
Time taken: 0.108 seconds

# 2. 準備數據文件
[root@HadoopNode00 data]# vi t_order.txt
1 iphone11 2 4999.0 貴>好用>香 101
2 huaweimate30 1 3999.0 國產>麒麟 102


# 3. 裝載數據時
hive> load data local inpath '/usr/apache-hive-1.2.1-bin/data/t_order.txt' into table t_order;
Loading data to table baizhi.t_order
Table baizhi.t_order stats: [numFiles=1, totalSize=81]
OK
Time taken: 0.223 seconds
hive> select * from t_order;
OK
1 iphone11 2 4999.0 ["貴","好用","香"] 101
2 huaweimate30 1 3999.0 ["國產","麒麟"] 102
Time taken: 0.04 seconds, Fetched: 2 row(s)/<string>/<code>
map分隔符

map keys terminated by '分隔符'


基於正則表達式數據裝載

<code># 1. 樣例數據
192.168.197.1 - - [20/Dec/2019:22:12:42 +0800] "GET / HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
192.168.197.1 - - [20/Dec/2019:22:12:42 +0800] "GET /favicon.ico HTTP/1.1" 404 571 "http://hadoopnode00/" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"

# 2. 正則表達式
^(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}).*\\[(.*)\\]\\s"(\\w+)\\s(.*)\\sHTTP\\/1.1"\\s(\\d{3})\\s.*$

# 3. 實踐
hive> create table t_log(ip string,access_time string,method string,uri string,code smallint) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES("input.regex"="^(\\\\d{1,3}\\\\.\\\\d{1,3}\\\\.\\\\d{1,3}\\\\.\\\\d{1,3}).*\\\\[(.*)\\\\]\\\\s\"(\\\\w+)\\\\s(.*)\\\\sHTTP\\\\/1.1\"\\\\s(\\\\d{3})\\\\s.*$")
> ;
OK
Time taken: 0.085 seconds
hive> load data local inpath '/usr/apache-hive-1.2.1-bin/data/nginx.log' into table t_log;
Loading data to table baizhi.t_log
Table baizhi.t_log stats: [numFiles=1, totalSize=416]
OK
Time taken: 0.195 seconds
hive> select * from t_log;
OK
192.168.197.1 20/Dec/2019:22:12:42 +0800 GET / 200
192.168.197.1 20/Dec/2019:22:12:42 +0800 GET /favicon.ico 404
Time taken: 0.035 seconds, Fetched: 2 row(s)/<code>


基於Json文件數據裝載

<code>[root@HadoopNode00 json]# vi user1.json
{"id":1,"name":"zs","sex":true,"birthday":"1998-12-12"}
{"id":2,"name":"ls","sex":true,"birthday":"1990-12-12"}

[root@HadoopNode00 json]# vi user2.json
{"id":3,"name":"ww","sex":false,"birthday":"1995-07-08"}
{"id":4,"name":"zl","sex":false}


# 2. 創建hive表
hive> create table t_user_json(id int,name varchar(32),sex boolean,birthday date)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.apache.hive.hcatalog.data.JsonSerDe
hive> ADD JAR /usr/apache-hive-1.2.1-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar ;
Added [/usr/apache-hive-1.2.1-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar] to class path

Added resources: [/usr/apache-hive-1.2.1-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar]
hive> create table t_user_json(id int,name varchar(32),sex boolean,birthday date)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
OK
Time taken: 0.138 seconds


# 3. 數據裝載
hive> load data local inpath '/usr/apache-hive-1.2.1-bin/data/json' overwrite into table t_user_json;
Loading data to table baizhi.t_user_json
Moved: 'hdfs://HadoopNode00:9000/user/hive/warehouse/baizhi.db/t_user_json/user1.json' to trash at: hdfs://HadoopNode00:9000/user/root/.Trash/Current
Moved: 'hdfs://HadoopNode00:9000/user/hive/warehouse/baizhi.db/t_user_json/user2.json' to trash at: hdfs://HadoopNode00:9000/user/root/.Trash/Current
Table baizhi.t_user_json stats: [numFiles=2, numRows=0, totalSize=202, rawDataSize=0]
OK
Time taken: 0.239 seconds
hive> select * from t_user_json;
OK
1 zs true 1998-12-12
2 ls true 1990-12-12
3 ww false 1995-07-08
4 zl false NULL/<code>


四、Hive表分類


大數據 Hive 筆記大全 收藏+轉發+關注


在Hive表分為了管理表(內部表)、外部表、分區表、分桶表、臨時表(依然與會話,hive客戶端如何創建一個臨時表,在會話結束時,自動刪除);

刪除表

<code>DROP TABLE [IF EXISTS] table_name [PURGE];/<code>

可選關鍵字purge,

添加則刪除表的元數據+表中內容

不添加只刪除表的元數據,而表中的內容會移動到HDFS的.trash/current垃圾數據存放目錄;

管理(內部)表

管理表會控制數據的生命週期,不能進行多團隊數據共享分析處理;

<code>0: jdbc:hive2://localhost:10000> drop table t_location;
No rows affected (0.885 seconds)
0: jdbc:hive2://localhost:10000> drop table t_user_json;
No rows affected (0.15 seconds)/<code>

外部表

<code># 1. 創建外部表的語法
ADD JAR /usr/apache-hive-1.2.1-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar ;
0: jdbc:hive2://localhost:10000> create external table t_user_json(id int,name varchar(32),sex boolean,birthday date)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
No rows affected (0.294 seconds)

# 2. 裝載數據

0: jdbc:hive2://localhost:10000> load data local inpath '/usr/apache-hive-1.2.1-bin/data/json' into table t_user_json;
INFO : Loading data to table baizhi.t_user_json from file:/usr/apache-hive-1.2.1-bin/data/json
INFO : Table baizhi.t_user_json stats: [numFiles=2, totalSize=202]
No rows affected (0.543 seconds)
0: jdbc:hive2://localhost:10000> drop table t_user_json purge;
No rows affected (0.139 seconds)/<code>

注意:

在刪除外部表時,僅僅刪除的是表的元數據(metadata),而不會刪除外部表控制的數據;

臨時表

臨時表關鍵字:temporary

生命週期依賴於會話

<code>0: jdbc:hive2://localhost:10000> create temporary table ttt_user(id int,name string);
No rows affected (0.132 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+------------+--+
| tab_name |
+------------+--+
| t_log |
| t_order |
| t_person |
| t_product |
| t_user |
| tt_user |
| ttt_user |
+------------+--+
7 rows selected (0.492 seconds)/<code>

分區表

外部表或者內部表都可以在創建時指定分區,這樣的就構成了分區表;分區就是數據分片思想,將一個大數據集按照規則劃分為若干個小數據集,這樣在進行數據加載或者處理時會有比較好處理性能; 優化策略

<code># 1. 創建分區表
0: jdbc:hive2://localhost:10000> create table ttt_user(id int,name varchar(32),sex boolean,birth date) partitioned by(country String,state String) row format delimited fields terminated by ' ' lines terminated by '\\n';
No rows affected (0.087 seconds)

# 2. 準備數據
1 zs true 2020-01-01
2 ls false 1990-01-01
3 ww false 2001-01-01

# 3. 裝載數據
0: jdbc:hive2://localhost:10000> load data local inpath '/usr/apache-hive-1.2.1-bin/data/ttt_user.txt' into table ttt_user partition(country='china',state='sh');


0: jdbc:hive2://localhost:10000> load data local inpath '/usr/apache-hive-1.2.1-bin/data/ttt_user.txt' into table ttt_user partition(country='china',state='bj');


# 4. 如何使用分區表
0: jdbc:hive2://localhost:10000> select * from ttt_user where country='china' and state='bj';
+--------------+----------------+---------------+-----------------+-------------------+-----------------+--+
| ttt_user.id | ttt_user.name | ttt_user.sex | ttt_user.birth | ttt_user.country | ttt_user.state |
+--------------+----------------+---------------+-----------------+-------------------+-----------------+--+
| 1 | zs | true | 2020-01-01 | china | bj |
| 2 | ls | false | 1990-01-01 | china | bj |
| 3 | ww | false | 2001-01-01 | china | bj |
+--------------+----------------+---------------+-----------------+-------------------+-----------------+--+
​/<code>

分區表:

hive優化方案,按照分區查詢時只需要加載分區內的數據,而不需要加載整個表的內容;

使用分區偽列+分區內容 進行數據加載

分桶表

分桶表指將數據集分解成容易組織管理若干個部分的技術;解決數據傾斜問題,已經大表和大表的JOIN,高效數據取樣;

<code># 1. 創建分桶表
0: jdbc:hive2://localhost:10000> create table t_bucket(id int,name string) clustered by (id) into 3 buckets;
No rows affected (0.141 seconds)


# 2. 注意 分桶表在裝載數據時不能使用load
# 3. 特殊設置
# 強制使用分桶表
set hive.enforce.bucketing = true;
# 設置reducer 任務數量 = 桶的數量
set mapred.reduce.tasks = 3;

# 4. 臨時表 首先將數據加載臨時表中
create temporary table t_bucket_tmp(id int,name string);
load data local inpath '/usr/apache-hive-1.2.1-bin/data/bucketTmp.txt' into table t_bucket_tmp;

# 5. 將臨時表中的數據轉換到分桶表中
insert into t_bucket select * from t_bucket_tmp cluster by id;/<code>


分區表的其它操作

<code>0: jdbc:hive2://localhost:10000> alter table ttt_user drop partition(country='china',state='sh');
INFO : Dropped the partition country=china/state=sh
No rows affected (0.224 seconds)
0: jdbc:hive2://localhost:10000> alter table ttt_user add partition(country='china',state='sh');
No rows affected (0.167 seconds)
0: jdbc:hive2://localhost:10000> show partitions ttt_user;
+-------------------------+--+
| partition |
+-------------------------+--+
| country=china/state=bj |
| country=china/state=sh |
+-------------------------+--+
2 rows selected (0.113 seconds)/<code>

截斷表

<code>0: jdbc:hive2://localhost:10000> select * from  t_user; 

+------------+--------------+-------------+------------------+--+
| t_user.id | t_user.name | t_user.sex | t_user.birthday |
+------------+--------------+-------------+------------------+--+
| 3 | zs | true | 2018-01-01 |
| 4 | ls | false | 1998-07-07 |
+------------+--------------+-------------+------------------+--+
2 rows selected (0.134 seconds)
0: jdbc:hive2://localhost:10000> truncate table t_user;
No rows affected (0.107 seconds)
0: jdbc:hive2://localhost:10000> select * from t_user;
+------------+--------------+-------------+------------------+--+
| t_user.id | t_user.name | t_user.sex | t_user.birthday |
+------------+--------------+-------------+------------------+--+
+------------+--------------+-------------+------------------+--+
/<code>


五、HiveOnJdbc

導入Hive JDBC驅動

<code><dependency>
<groupid>org.apache.hadoop/<groupid>
<artifactid>hadoop-client/<artifactid>
<version>2.6.0/<version>
/<dependency>
<dependency>
<groupid>org.apache.hive/<groupid>
<artifactid>hive-jdbc/<artifactid>
<version>1.1.0/<version>
/<dependency>/<code>

Hive驅動類

<code>org.apache.hive.jdbc.HiveDriver/<code>

應用程序

<code>package com.baizhi;

import java.sql.*;

public class HiveOnJdbc {
public static void main(String[] args) throws ClassNotFoundException, SQLException {
Class.forName("org.apache.hive.jdbc.HiveDriver");

Connection connection = DriverManager.getConnection("jdbc:hive2://HadoopNode00:10000/baizhi");


String sql = "select * from ttt_user where country=? and state=?";

PreparedStatement pstm = connection.prepareStatement(sql);

pstm.setString(1, "china");
pstm.setString(2, "bj");

ResultSet resultSet = pstm.executeQuery();

while (resultSet.next()) {
int id = resultSet.getInt("id");
String name = resultSet.getString(2);
Boolean sex = resultSet.getBoolean("sex");
Date birth = resultSet.getDate("birth");
System.out.println(id + "\\t" + name + "\\t" + sex + "\\t" + birth);
}

resultSet.close();
pstm.close();
connection.close();
}
}/<code>

六、Hive SQL操作

回顧

<code>DB SQL查詢語法

select 字段列表 from 表名 where 過濾條件 group by 分組字段 having 分組後過濾 order by 排序字段 asc | desc limit 限制結果的返回條數;/<code>

Hive SQL完整語法

<code>SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list] # 計算結果全局有序(全局只有一個Reducer)
[CLUSTER BY col_list| [DISTRIBUTE BY col_list] [SORT BY col_list asc|desc]] # 分區鍵 id.hashCode% numReduceTask
[LIMIT number]/<code>

注意:

order by col_list asc|desc: 全局排序,只有一個Reducer任務;

DISTRIBUTE BY col_list: shuffle進行分區時,分區鍵; 根據指定的字段值進行分區shuffle

SORT BY col_list: 對分區進行局部排序字段

CLUSTER BY col_list: 如果DISTRIBUTE BY col_list + SORT BY col_list, 簡寫寫法;

<code># 1. 分組 + 分區後過濾
0: jdbc:hive2://localhost:10000> select sex,count(sex) from ttt_user where country='china' and state='bj' group by sex having sex= false;

# 2. 分組 + 結果集全局排序
0: jdbc:hive2://localhost:10000> select sex,count(sex) as num from ttt_user where country='china' and state='bj' group by sex order by num desc;

# 3. 分組 + cluster by使用
0: jdbc:hive2://localhost:10000> select sex,count(sex) as num from ttt_user where country='china' and state='bj' group by sex cluster by sex;

# 4. 分組 + distribute by + sort by
0: jdbc:hive2://localhost:10000> select sex,count(sex) as num from ttt_user where country='china' and state='bj' group by sex distribute by sex sort by sex desc;

# 5. limit使用
0: jdbc:hive2://localhost:10000> select sex,count(sex) as num from ttt_user where country='china' and state='bj' group by sex distribute by sex sort by sex desc limit 1;/<code>

表連接查詢

內連接([inner] join)

左表和右表符合條件的數據進行連接操作,合為一張大表;

<code># 員工數據
1,zs,true,18,A
2,ls,false,20,B
3,ww,false,25,A
4,zl,false,30,B
5,tq,true,21,C

# 部門數據
A,研發部

B,市場部
C,銷售部
D,後勤部

0: jdbc:hive2://localhost:10000> create table t_employee(id int,name varchar(32),sex boolean,age tinyint,dept string) row format delimited fields terminated by ',' lines terminated by '\\n';
No rows affected (0.11 seconds)
0: jdbc:hive2://localhost:10000> load data local inpath '/usr/apache-hive-1.2.1-bin/data/employee.txt' into table t_employee;
INFO : Loading data to table baizhi.t_employee from file:/usr/apache-hive-1.2.1-bin/data/employee.txt
INFO : Table baizhi.t_employee stats: [numFiles=1, totalSize=78]
No rows affected (0.286 seconds)
0: jdbc:hive2://localhost:10000> select * from t_employee;
+----------------+------------------+-----------------+-----------------+------------------+--+
| t_employee.id | t_employee.name | t_employee.sex | t_employee.age | t_employee.dept |
+----------------+------------------+-----------------+-----------------+------------------+--+
| 1 | zs | true | 18 | A |
| 2 | ls | false | 20 | B |
| 3 | ww | false | 25 | A |
| 4 | zl | false | 30 | B |
| 5 | tq | true | 21 | C |
+----------------+------------------+-----------------+-----------------+------------------+--+


0: jdbc:hive2://localhost:10000> create table t_dept(deptId string,name string) row format delimited fields terminated by ',' lines terminated by '\\n';
No rows affected (0.094 seconds)
0: jdbc:hive2://localhost:10000> load data local inpath '/usr/apache-hive-1.2.1-bin/data/dept.txt' into table t_dept;
INFO : Loading data to table baizhi.t_dept from file:/usr/apache-hive-1.2.1-bin/data/dept.txt
INFO : Table baizhi.t_dept stats: [numFiles=1, totalSize=48]
No rows affected (0.253 seconds)
0: jdbc:hive2://localhost:10000> select * from t_dept;
+----------------+--------------+--+
| t_dept.deptid | t_dept.name |
+----------------+--------------+--+
| A | 研發部 |
| B | 市場部 |
| C | 銷售部 |
| D | 後勤部 |
+----------------+--------------+--+


0: jdbc:hive2://localhost:10000> select * from t_employee t1 inner join t_dept t2 on t1.dept = t2.deptId;
+--------+----------+---------+---------+----------+------------+----------+--+
| t1.id | t1.name | t1.sex | t1.age | t1.dept | t2.deptid | t2.name |
+--------+----------+---------+---------+----------+------------+----------+--+
| 1 | zs | true | 18 | A | A | 研發部 |
| 2 | ls | false | 20 | B | B | 市場部 |
| 3 | ww | false | 25 | A | A | 研發部 |
| 4 | zl | false | 30 | B | B | 市場部 |

| 5 | tq | true | 21 | C | C | 銷售部 |
+--------+----------+---------+---------+----------+------------+----------+--+/<code>


外連接(left | right outer join)

<code>0: jdbc:hive2://localhost:10000> select * from t_employee t1 left outer join t_dept t2 on t1.dept = t2.deptId;
+--------+----------+---------+---------+----------+------------+----------+--+
| t1.id | t1.name | t1.sex | t1.age | t1.dept | t2.deptid | t2.name |
+--------+----------+---------+---------+----------+------------+----------+--+
| 1 | zs | true | 18 | A | A | 研發部 |
| 2 | ls | false | 20 | B | B | 市場部 |
| 3 | ww | false | 25 | A | A | 研發部 |
| 4 | zl | false | 30 | B | B | 市場部 |
| 5 | tq | true | 21 | C | C | 銷售部 |
+--------+----------+---------+---------+----------+------------+----------+--+



0: jdbc:hive2://localhost:10000> select * from t_employee t1 right outer join t_dept t2 on t1.dept = t2.deptId;
+--------+----------+---------+---------+----------+------------+----------+--+
| t1.id | t1.name | t1.sex | t1.age | t1.dept | t2.deptid | t2.name |
+--------+----------+---------+---------+----------+------------+----------+--+
| 1 | zs | true | 18 | A | A | 研發部 |
| 3 | ww | false | 25 | A | A | 研發部 |
| 2 | ls | false | 20 | B | B | 市場部 |
| 4 | zl | false | 30 | B | B | 市場部 |
| 5 | tq | true | 21 | C | C | 銷售部 |
| NULL | NULL | NULL | NULL | NULL | D | 後勤部 |
+--------+----------+---------+---------+----------+------------+----------+--+
​/<code>


左半開連接(left semi join)

左半開連接會返回左表的數據,前提是記錄需要滿足右表on的判定條件;

<code>0: jdbc:hive2://localhost:10000> select * from t_employee t1 left semi join t_dept t2 on t1.dept = t2.deptId;
INFO : Execution completed successfully

INFO : MapredLocal task succeeded
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : number of splits:1
INFO : Submitting tokens for job: job_1577964101376_0017
INFO : The url to track the job: http://HadoopNode00:8088/proxy/application_1577964101376_0017/
INFO : Starting Job = job_1577964101376_0017, Tracking URL = http://HadoopNode00:8088/proxy/application_1577964101376_0017/
INFO : Kill Command = /home/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1577964101376_0017
INFO : Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
INFO : 2020-01-03 23:02:56,491 Stage-3 map = 0%, reduce = 0%
INFO : 2020-01-03 23:03:02,696 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.83 sec
INFO : MapReduce Total cumulative CPU time: 2 seconds 830 msec
INFO : Ended Job = job_1577964101376_0017
+--------+----------+---------+---------+----------+--+
| t1.id | t1.name | t1.sex | t1.age | t1.dept |
+--------+----------+---------+---------+----------+--+
| 1 | zs | true | 18 | A |
| 2 | ls | false | 20 | B |
| 3 | ww | false | 25 | A |
| 4 | zl | false | 30 | B |
| 5 | tq | true | 21 | C |
+--------+----------+---------+---------+----------+--+/<code>


map-side join

map端連接,hive優化表連接查詢方法(小表和大表Join);

注意:

  • map端連接只適用於內連接和左外連接;
  • hive 0.70版本之前,select /*+mapjoin(小表別名)*/ .....
  • hive 0.70版本之後,要求set hive.auto.convert.join=true;, 自動join優化,要求小表需要寫在join關鍵字之前,因為Hive Join從左向右連接操作;
<code>0: jdbc:hive2://localhost:10000> select /*+mapjoin(t2)*/ * from t_employee t1 left outer join t_dept t2 on t1.dept = t2.deptId;


0: jdbc:hive2://localhost:10000> set hive.auto.convert.join=true;

0: jdbc:hive2://localhost:10000> select * from t_dept t2 left outer join t_employee t1 on t1.dept = t2.deptId;/<code>


Full Outer Join

全外連接 左邊右表符合條件結果進行連接,保留左表和右表不符合條件的結果

笛卡爾乘積連接

左表和右表交叉連接 左表5條數據 右表6條數據,連接後會產生30條記錄


七、Hive 和HBase整合

要求

  • HDFS
  • ZooKeeper
  • HBase集群運行正常

準備HBase BigTable

<code>hbase(main):002:0> create 'baizhi2:t_user','cf1'
0 row(s) in 2.4760 seconds
hbase(main):001:0> put 'baizhi2:t_user','user101','cf1:name','zs'
0 row(s) in 0.3800 seconds

hbase(main):002:0> put 'baizhi2:t_user','user101','cf1:age',18
0 row(s) in 0.0180 seconds

hbase(main):003:0> put 'baizhi2:t_user','user102','cf1:name','ls'

0 row(s) in 0.0060 seconds

hbase(main):004:0> put 'baizhi2:t_user','user102','cf1:age',20
0 row(s) in 0.0180 seconds

hbase(main):005:0> scan 'baizhi2:t_user'
ROW COLUMN+CELL
user101 column=cf1:age, timestamp=1578068239429, value=18
user101 column=cf1:name, timestamp=1578068227481, value=zs
user102 column=cf1:age, timestamp=1578068289077, value=20
user102 column=cf1:name, timestamp=1578068278698, value=ls
2 row(s) in 0.0420 seconds/<code>

創建Hive Table並關聯HBase

<code>create external table t_hbase_user(id string,name string,age int) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties('hbase.columns.mapping'=':key,cf1:name,cf1:age') tblproperties('hbase.table.name'='baizhi2:t_user');


0: jdbc:hive2://localhost:10000> select * from t_hbase_user;
+------------------+--------------------+-------------------+--+
| t_hbase_user.id | t_hbase_user.name | t_hbase_user.age |
+------------------+--------------------+-------------------+--+
| user101 | zs | 18 |
| user102 | ls | 20 |
+------------------+--------------------+-------------------+--+
2 rows selected (1.142 seconds)/<code>
<code>create external table t_hbase_user(id string,name string,age int) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties('hbase.columns.mapping'=':key,cf1:name,cf1:age') tblproperties('hbase.table.name'='baizhi2:t_user');


0: jdbc:hive2://localhost:10000> select * from t_hbase_user;
+------------------+--------------------+-------------------+--+
| t_hbase_user.id | t_hbase_user.name | t_hbase_user.age |
+------------------+--------------------+-------------------+--+
| user101 | zs | 18 |
| user102 | ls | 20 |
+------------------+--------------------+-------------------+--+
2 rows selected (1.142 seconds)/<code>


大數據 Hive 筆記大全 收藏+轉發+關注



分享到:


相關文章: