03.03 Hive的基本操作2

3.1.6 Hive事務操作

Hive開始支持事務,是在Hive0.14之後。HDFS的文件,只能允許新建,刪除,對文件中的內容進行更新,不允許單條修改。Hive的文件存儲是基於HDFS文件存在的,所以原則上不會直接對HDFS做文件內容的事務更新,只能是採取另外的手段來完成。即用HDFS文件作為原始數據,用delta文件作為操作日誌的記錄。當訪問Hive數據時,根據HDFS文件和delta文件做合併,查詢最新的數據。

綜上,Hive支持事務的前提是初始化數據和增量操作,分開存儲。這種存儲方案導致的最終結果是Hive存儲的delta文件日益增多,增加NameNode的管理難度,NameNode是設計來管理大容量小數量文件,每一個HDFS的文件目錄以及文件都會增大NameNode的內存消耗。

Hive支持事務所需要的前提配置:(1)配置6個屬性。(2)在Hive元數據中增加對事務支持所需的表(這一點不知道有沒有改善,理論上太複雜)。(3)重新啟動Hive服務。(4)建表需要使用ORC存儲。(5)該數據庫必須具有ACID[1]特性,否則在事務過程當中無法保證數據的正確性,交易過程及極可能達不到交易方的要求。

打開事務屬性支持有如下幾個配置:

SET hive.support.concurrency = true;

SET hive.enforce.bucketing = true;

SET hive.exec.dynamic.partition.mode = nonstrict;

SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

SET hive.compactor.initiator.on = true;

SET hive.compactor.worker.threads = 1;

配置好了事務屬性之後,可以瞭解一下關於事務表的操作。首先創建桶表(事務表,優化的列模式文件),在創建好桶表之後,使用desc命令來查看創建的表結構,具體代碼以及結果如下所示,

hive> desc formatted tx;

OK

# col_name data_type comment

id int

name string

# Detailed Table Information

Database: default

Owner: laura

CreateTime: Wed Aug 02 18:31:52 PDT 2017

LastAccessTime: UNKNOWN

Retention: 0

Location: hdfs://hadoop0:8020/user/hive/warehouse/tx

Table Type: MANAGED_TABLE

Table Parameters:

COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}

numFiles 0

numRows 0

rawDataSize 0

totalSize 0

transactional true

transient_lastDdlTime 1501723912

# Storage Information

SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde

InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat

Compressed: No

Num Buckets: 2

Bucket Columns: [id]

Sort Columns: []

Storage Desc Params:

field.delim \\t

line.delim \\n

serialization.format \\t

Time taken: 0.418 seconds, Fetched: 34 row(s)

將數據插入事務表之前必須打開之前介紹的事務屬性的所有配置才可以。然後就可以向表中插入一些數據,

hive> insert into tx(id,name) values(1,'tom');

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Query ID = laura_20170802183559_92c64423-63aa-4a82-abbd-6d24039f1ceb

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 2

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1501679648039_0003, Tracking URL = http://hadoop0:8888/ proxy/application_1501679648039_0003/

Kill Command = /home/laura/hadoop-2.7.3/bin/hadoop job -kill job_1501679648039_ 0003

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2

2017-08-02 18:36:37,024 Stage-1 map = 0%, reduce = 0%

2017-08-02 18:36:59,492 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.5 sec

2017-08-02 18:37:15,802 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 8.94 sec

2017-08-02 18:37:23,286 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.26 sec

MapReduce Total cumulative CPU time: 13 seconds 260 msec

Ended Job = job_1501679648039_0003

Loading data to table default.tx

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 2 Cumulative CPU: 13.26 sec HDFS Read: 11762

HDFS Write: 931 SUCCESS

Total MapReduce CPU Time Spent: 13 seconds 260 msec

OK

Time taken: 88.482 seconds

hive> insert into tx(id,name) values(2,'tomlee');

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Query ID = laura_20170802183855_89b897ca-a3d8-4be3-aa7e-50da1ae39ce3

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 2

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1501679648039_0004, Tracking URL = http://hadoop0:8888/ proxy/application_1501679648039_0004/

Kill Command = /home/laura/hadoop-2.7.3/bin/hadoop job -kill job_1501679648039_ 0004

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2

2017-08-02 18:39:13,830 Stage-1 map = 0%, reduce = 0%

2017-08-02 18:39:24,580 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.52 sec

2017-08-02 18:39:38,346 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 6.93 sec

2017-08-02 18:39:46,007 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.02 sec

MapReduce Total cumulative CPU time: 11 seconds 20 msec

Ended Job = job_1501679648039_0004

Loading data to table default.tx

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 2 Cumulative CPU: 11.02 sec HDFS Read: 11667 HDFS Write: 945 SUCCESS

Total MapReduce CPU Time Spent: 11 seconds 20 msec

OK

Time taken: 52.174 seconds

hive> insert into tx(id,name) values(3,'tomaslee');

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Query ID = laura_20170802184117_d8d75ebf-37b6-4623-b464-e6348eddae36

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 2

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1501679648039_0005, Tracking URL = http://hadoop0:8888/ proxy/application_1501679648039_0005/

Kill Command = /home/laura/hadoop-2.7.3/bin/hadoop job -kill job_1501679648039_ 0005

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2

2017-08-02 18:41:33,517 Stage-1 map = 0%, reduce = 0%

2017-08-02 18:41:55,639 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.0 sec

2017-08-02 18:42:08,404 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 11.91 sec

2017-08-02 18:42:10,490 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 15.66 sec

MapReduce Total cumulative CPU time: 15 seconds 660 msec

Ended Job = job_1501679648039_0005

Loading data to table default.tx

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 2 Cumulative CPU: 15.66 sec HDFS Read: 11677 HDFS Write: 961 SUCCESS

Total MapReduce CPU Time Spent: 15 seconds 660 msec

OK

Time taken: 54.029 seconds

3.1.7 Hive優化

在Hive中,有很多因素會影響到Hive本身的性能,而在Hive性能優化時,把HiveQL當做MapReduce程序來讀,即從M/R的運行角度來考慮優化性能,從更底層思考如何優化運算性能,而不僅僅侷限於邏輯代碼的替換層面。

RAC(Real Application Cluster)真正應用集群就像一輛機動靈活的小貨車,響應快;Hadoop就像吞吐量巨大的輪船,啟動開銷大,如果每次只做小數量的輸入輸出,利用率將會很低。所以用好Hadoop的首要任務是增大每次任務所搭建的數據量。

Hadoop的核心能力是partition和sort,因而這也是優化的根本。

觀察Hadoop處理數據的過程,有幾個顯著的特徵:

(1) 數據的大規模並不是負載重點,造成運行壓力過大是因為運行數據的傾斜。

(2) jobs數比較多的作業運行效率相對比較低,比如即使有幾百行的表,如果多次關聯對此彙總,產生幾十個jobs,將會需要30分鐘以上的時間且大部分時間將用於作業分配,初始化和數據輸出。M/R作業初始化的時間是比較耗時間資源的一個部分。

(3) 在使用SUM,COUNT,MAX,MIN等UDAF函數時,不怕數據傾斜問題,Hadoop在Map端的彙總合併優化過,使數據傾斜不成問題。

(4) COUNT(DISTINCT)在數據量大的情況下,效率較低,如果多COUNT(DISTINCT)效率更低,因為COUNT(DISTINCT)是按GROUP BY字段分組,按DISTINCT字段排序,一般這種分佈式方式是很傾斜的;比如:男UV,女UV,淘寶一天30億的PV,如果按性別分組,分配2個reduce,每個reduce處理15億數據。

(5) 數據傾斜是導致效率大幅降低的主要原因,可以採用多一次Map/Reduce的方法,避免傾斜。

最後得出的結論是:避實就虛,用job數的增加,輸入量的增加,佔用更多存儲空間,充分利用空閒CPU等各種方法,分解數據傾斜造成的負擔。KPTI和PTI補丁會對Linux的性能造成很大的影響。據報告顯示,CPU性能下降的範圍是5%~35%。

接下來是對Hive表中的一些優化方法的實驗,

首先需要知道Hive支持的一些文件格式,

(1)sequencefile(序列文件,面向行,key value對,二進制格式的,可切割)。

(2)rcfile(面向列,record columnar file,記錄列式文件)。

(3)orc(optimized record columnar file,替代了rc file,優化的rcfile)。

(4)parquet(google dremel格式)。

瞭解到Hive的文件格式之後,可以通過配置文件來設置文件的默認格式,

hive.default.fileformat(hive-site.xml)

然後創建一個序列表(sequencefile),

create table seqtset1(id int,name string,age int) stored as sequencefile;

序列文件不能直接插入文本文檔,需要將文本文檔上傳到test4上,然後再插入到seqtest1上。

insert into seqtest1 select * from test4;

hive> insert into seqtest1 select * from test4;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.

Query ID = lvqianqian_20181121121532_1b8dee3a-eef0-4172-89c4-b15171a19cc5

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1479756136352_0001, Tracking URL = http://hadoop0:8888/ proxy/application_1479756136352_0001/

Kill Command = /home/hadoop/software/hadoop-2.7.3/bin/hadoop job -kill job_1479756136352_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2018-11-21 12:31:03,232 Stage-1 map = 0%, reduce = 0%

2018-11-21 12:31:12,598 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.17 sec

MapReduce Total cumulative CPU time: 1 seconds 170 msec

Ended Job = job_1479756136352_0001

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop0:8020/user/hive/warehouse/myhive.db/seqtest1/. hive-staging_hive_2018-11-21_12-23-21_534_6082095967638849700-1/-ext-10000

Loading data to table myhive.seqtest1

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 1.32 sec HDFS Read: 4039 HDFS Write: 378110 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 320 msec

OK

Time taken: 481.128 seconds

除了可以限制Hive的輸入文件以外,還有一個比較好用的優化方法,就是對其進行壓縮,在Hive中對中間數據或最終數據做壓縮,是提高數據吞吐量和性能(減小載入內存的數據量)的一種手段。對數據做壓縮,可以大量減少磁盤的存儲空間,比如基於文本的數據文件,可以將文件壓縮40%或更多。同時壓縮後的文件在磁盤間傳輸和I/O也會大大減少;當然壓縮和解壓縮也會帶來額外的CPU開銷,但是卻可以節省更多的I/O和使用更少的內存開銷。常見的壓縮算法是lz4、LZO、snappy三種,三種壓縮算法的效率比較如下,

lz4 < LZO < snappy

關於是否需要進行壓縮的配置,也可以在上述的配置文件中進行配置,

hive.exec.compress.output=false //輸出文件是否壓縮,默認為false

hive.exec.compress.intermediate=false //啟用中間文件是否壓縮,默認為false

hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec //設置壓縮編解碼器,默認為空

hive.intermediate.compression.type //壓縮類型

壓縮命令:

set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.


SnappyCodec;

以上命令可以對數據進行壓縮來提高性能,除此之外,還可以在Hive存儲數據的時候,對存儲進行優化,對於存儲優化,思路就是設置更多的副本數,增加熱門數據,更好地利用本地化優勢,

hdfs dfs -setrep -R -w 4 /user/hive/warehouse/employee;

以上幾種優化的方式,都可以在Hive創建表和插入數據的時候,對其進行優化,但是在數據成功存儲在Hive之後,也可以對其進行優化,就是對job和查詢的優化,

(1) local mode(文件自動轉成本地模式下),有三個需要設置:0.7.0之後,Hive支持本地模式(local)(hadoop standalone)運行。本地模式中,可以通過修改幾個配置信息來進行Hive的優化,即將自動執行模式設置為本地模式,設置本地模式的輸入最大字節數,設置本地模式的最大輸入文件個數。具體配置信息如下,

hive.exec.mode.local.auto=true

hive.exec.mode.local.auto.inputbytes.max=134217728

hive.exec.mode.local.auto.input.files.max=4

(2)set hive.exec.mode.local.auto=true //將模式設置為本地模式。

(3)set hive.map.aggr=true; //在map中會做部分聚集操作,效率更高但需要更多的內存。

將所有優化方式都掌握之後,接下來向一個表中插入一組數據測試一下:insert into table test4 values(1,'kk',13);(下列命令行中存在local hadoop)。

第一步插入數據,

hive> insert into table test4 values(1,'kk',13);

Automatically selecting local only mode for query

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.

Query ID = lvqianqian_20181121121532_1b8dee3a-eef0-4172-89c4-b15171a19cc5

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

Job running in-process (local Hadoop)

2018-11-21 13:29:20,133 Stage-1 map = 0%, reduce = 0%

2018-11-21 13:29:52,869 Stage-1 map = 100%, reduce = 0%

Ended Job = job_local1269969100_0001

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop0:8020/user/hive/warehouse/myhive.db/test4/. hive-staging_hive_2018-11-21_13-24-49_487_3825075573599201360-1/-ext-10000

Loading data to table myhive.test4

MapReduce Jobs Launched:

Stage-Stage-1: HDFS Read: 79 HDFS Write: 181371729 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

Time taken: 304.094 seconds

第二步設置字節數和最大輸入文件個數,

set hive.exec.mode.local.auto.inputbytes.max=50000000;

set hive.exec.mode.local.auto.input.files.max=5;

第三步啟用本地模式,

set hive.exec.mode.local.auto=true;//默認為false;

除了這種方法之外,還可以對limit進行優化。關於limit優化,可以修改以下幾個配置:

(1) 是否啟用limit優化。當使用limit語句時,對源數據進行抽樣。

(2) 在使用limit做數據的子集查詢時保證的最小行數據量。

(3) 在使用limit做數據子集查詢時,採樣的最大文件數。

(4) 使用簡單limit數據抽樣時,允許的最大行數。

Hive的配置信息如下所示。

<property>

<name>hive.limit.optimize.enable/<name>

<value>false/<value>

<description>Whether to enable to optimization to trying a smaller subset of data for simple LIMIT first./<description>

<property>

<name>hive.limit.row.max.size/<name>

<value>100000/<value>

<description>When trying a smaller subset of data for simple LIMIT, how much size we need to guarantee each row to have at least./<description>

<property>

<name>hive.limit.optimize.limit.file/<name>

<value>10/<value>

<description>When trying a smaller subset of data for simple LIMIT, maximum number of files we can sample./<description>

除了可以修改以上配置來優化Hive之外,還可以修改幾個配置:

(1) 是否開啟自動使用索引。

(2) 每個分區表可以掃描多少分區。默認值“-1”表示無限制。

關於Hive的優化配置如下所示。

<property>

<name>hive.limit.optimize.fetch.max/<name>

<value>50000/<value>

<description>

Maximum number of rows allowed for a smaller subset of data for simple LIMIT, if it is a fetch query.

Insert queries are not restricted by this limit.

<property>

<name>hive.limit.pushdown.memory.usage/<name>

<value>0.1/<value>

<description>

Expects value between 0.0f and 1.0f.

The fraction of available memory to be used for buffering rows in Reducesink operator for limit pushdown optimization.

<property>

<name>hive.limit.query.max.table.partition/<name>

<value>-1/<value>

<description>

This controls how many partitions can be scanned for each partitioned table.

The default value "-1" means no limit.

以上幾個方法都可以對Hive進行優化操作。

關於java虛擬機的重用部分(JVM重用是Hadoop調優參數的內容,對Hive的性能具有非常大的影響,特別是對於很難避免小文件的場景或者task特別多的場景,這類場景大多數執行時間都很短。)其實也可以對整個的Hive的數據操作部分進行一定的優化,並行轉換成串行,同一個job的task才涉及重用,不同job的task運行在單獨的JVM中set mapred.job.reuse.jvm.num.tasks=5;(本地環境下,就一個java虛擬機,數值如果小於5,就設置一個java虛擬機,如果超過5,就設置兩個虛擬機,如果是-1的話,表示所有任務都在一個虛擬機中,就設置一個虛擬機,-1表示所有task都運行在一個JVM中)。explain解釋執行計劃,對於沒有固定依賴關係的task,可以進行併發執行。當java虛擬機重用時,對於沒有固定依賴關係的,就不能採用按序執行,這個時候就可以採用併發執行。因此以下配置可以將整個job進行並行處理,

set hive.exec.parallel=true;(啟用並行執行) //啟用mr的併發執行,默認為false

set hive.exec.parallel.thread.number=16;(定義了最大的並行運行的數量)//設置併發執行的job數,默認為8

Hadoop默認配置是使用派生JVM來執行Map和Reduce任務的,這是JVM的啟動過程可能會造成相當大的開銷,尤其是執行的job包含有成千上萬個task任務的情況。關於JVM重用的配置如下所示,

[不推薦]

SET mapred.job.reuse.jvm.num.tasks=5; //在mapreduce-1使用,yarn不適用

com.it18zhang.myhadoop273_1211.join.reduce.App

[yarn]

//mapred-site.xml

mapreduce.job.ubertask.enable=false //啟用單個jvm一系列task,默認為false

mapreduce.job.ubertask.maxmaps=9 //最大map數>=9,只能調低

mapreduce.job.ubertask.maxreduces=1 //目前只支持1個reduce

mapreduce.job.ubertask.maxbytes=128m

Hive中還有一個經常會出現的問題,也是比較嚴重的一個問題,會嚴重影響到Hive job 的運行效率,那就是Hive中的數據傾斜問題。通俗地說,就是我們在處理的時候數據分佈的不均,導致了數據大量集中在某一點。造成了數據的熱點。其實在MapReduce分析的時候最怕的就是數據傾斜,在做數據運算的時候會涉及count、distinct、group by、join等操作,這些都會觸發shuffle操作。一旦觸發,所有相同key的值就會拉到一個或幾個節點上,就容易發生單點問題。通常會出現下面的情況:

Map階段處理比較快,Reduce階段處理比較慢。其實Reduce階段不應該很慢,如果很慢,很大可能就是出現了數據傾斜。

(1) 有的Reduce很快,有的Reduce很慢。

(2) Hive的執行是分階段的,Map處理數據量的差異取決於上一個stage的Reduce輸出,如果在Map端,有的job運行很快,這其實也出現了數據傾斜。

關於數據傾斜的問題,可以修改一些配置來進行優化,如下所示,

SET hive.optimize.skewjoin=true; //開啟傾斜優化

SET hive.skewjoin.key=100000; //key量超過該值,新的key發送給未使用的reduce

SET hive.groupby.skewindata=true;//在groupby中使用應用數據傾斜優化,默認為false

3.1.8 Hive的解釋計劃及分析

我們都知道,Hive在執行的時候會把所對應的SQL語句都會轉換成MapReduce代碼執行,但是具體的MR執行信息我們怎樣才能看出來呢?這裡就用到了explain的關鍵字,它可以詳細的表示出在執行所對應的語句所對應的MR代碼。語法格式如下。extended關鍵字可更加詳細的列舉出代碼的執行過程。

$beeline>explain [extended] select sum(id) from customers ;

0: jdbc:hive2://localhost:10000/myhive> explain select sum(id) from users;

Hive的基本操作2


Hive的基本操作2


Hive的基本操作2


Hive的基本操作2


Hive的基本操作2


Hive的基本操作2


Hive的基本操作2


Hive的基本操作2


Hive的基本操作2


Hive的基本操作2


Hive的基本操作2


Hive的基本操作2


Hive的基本操作2


explain會把查詢語句轉化成stage組成的序列,主要由三方面組成:

(1) 查詢的抽象語法樹。

(2) plan中各個stage的依賴情況。

(3) 每個階段的具體描述:描述具體來說就是顯示出對應的操作算子和與之操作的對應的數據,例如查詢算子,filter算子,fetch算子等等。

接下來舉幾個例子來看一下explain的語法及作用。

第一個例子,展示關於查詢語句的具體過程。

hive> explain select * from test1;

OK

STAGE DEPENDENCIES:

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-0

Fetch Operator

limit: -1

Processor Tree:

TableScan

alias: test1

Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE

Select Operator

expressions: id (type: int)

outputColumnNames: _col0

Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE

ListSink

Time taken: 2.564 seconds, Fetched: 17 row(s)

第二個例子,展示關於查詢語句以及限制行數的具體過程。

hive> explain select * from test1 limit 1,2;

OK

STAGE DEPENDENCIES:

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-0

Fetch Operator

limit: 2

Processor Tree:

TableScan

alias: test1

Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE

Select Operator

expressions: id (type: int)

outputColumnNames: _col0

Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE

Limit

Number of rows: 2

Offset of rows: 1

Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE

ListSink

Time taken: 0.18 seconds, Fetched: 21 row(s)

在瞭解到explain的基本語法以及作用之後,還需要了解一個關鍵字,這個關鍵字就是analyze。該命令可以分析表數據,用於執行計劃選擇的參考。對錶、partition、column level級別元數據進行統計,作為input傳遞給CBO(cost-based Optimizer),會選擇成本最低查詢計劃來執行。接下來舉一個簡單的例子來說明一下analyze的作用。

這個例子的功能是分析計算統計的結果,如下,

hive> analyze table test1 compute statistics;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.

Query ID = lvqianqian_20181122130047_adb1e845-5401-44af-8c26-edb7d8f7270c

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1479847093336_0001, Tracking URL = http://hadoop0:8888/ proxy/application_1479847093336_0001/

Kill Command = /home/hadoop/software/hadoop-2.7.3/bin/hadoop job -kill job_1479847093336_0001

Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0

2018-11-22 13:22:05,790 Stage-0 map = 0%, reduce = 0%

2018-11-22 13:22:08,608 Stage-0 map = 100%, reduce = 0%, Cumulative CPU 2.5 sec

MapReduce Total cumulative CPU time: 2 seconds 500 msec

Ended Job = job_1479847093336_0001

MapReduce Jobs Launched:

Stage-Stage-0: Map: 1 Cumulative CPU: 2.5 sec HDFS Read: 2768 HDFS Write: 374948 SUCCESS

Total MapReduce CPU Time Spent: 2 seconds 500 msec

OK

Time taken: 500.843 seconds

0: jdbc:hive2://localhost:10000/myhive> analyze table users compute statistics ;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

No rows affected (18.816 seconds)

0: jdbc:hive2://localhost:10000/myhive> desc extended users;

Hive的基本操作2


分享到:


相關文章: