數據倉庫-拉鍊表

在數據倉庫的數據模型設計過程中,經常會遇到這樣的需求:

1.數據量比較大;

2.表中的部分字段會被update,如用戶的地址,產品的描述信息,訂單的狀態等等;

3.需要查看某一個時間點或者時間段的歷史快照信息,比如,查看某一個訂單在歷史某一個時間點的狀態, 比如,查看某一個用戶在過去某一段時間內,更新過幾次等等;

4.變化的比例和頻率不是很大,比如,總共有1000萬的會員,每天新增和發生變化的有10萬左右;

5.如果對這邊表每天都保留一份全量,那麼每次全量中會保存很多不變的信息,對存儲是極大的浪費;遇到些問題在沒解決嗎?

拉鍊歷史表,既能滿足反應數據的歷史狀態,又可以最大程度的節省存儲。

拉鍊表案例:

1、有一張訂單表,2019-06-20 這天裡面有三條訂單數據:


數據倉庫-拉鍊表

到了2019-06-21這天,表中有5條記錄:

數據倉庫-拉鍊表

到了2019-06-22日,表中有6條記錄:


數據倉庫-拉鍊表

這種存儲方式有什麼問題呢?

1、只保留一份全量,則數據和6月22日的記錄一樣,如果需要查看6月21日訂單001的狀態,則無法滿足;

2、每天都保留一份全量,則數據倉庫中的該表共有14條記錄,但好多記錄都是重複保存,沒有任務變化,如訂單002,004,數據量大了,會造成很大的存儲浪費;

如果設計成歷史拉鍊表,如下:


數據倉庫-拉鍊表

拉鍊表解釋:

說明:

1.dw_begin_date表示該條記錄的生命週期開始時間,dw_end_date表示該條記錄的生命週期結束時間;

2.dw_end_date = '9999-12-31'表示該條記錄目前處於有效狀態;

3.如果查詢當前所有有效的記錄,則select * from order_his where dw_end_date = '9999-12-31'

4.如果查詢2012-06-21的歷史快照,則select * from order_his where dw_begin_date <= '2012-06-21' and dw_end_date >= '2012-06-21',這條語句會查詢到以下記錄:


數據倉庫-拉鍊表

可以看出,這樣的歷史拉鍊表,既能滿足對歷史數據的需求,又能很大程度的節省存儲資源;


拉鍊表實戰:

源系統中訂單表結構為:

<code>CREATE TABLE orders_20190821 (orderid INT,createtime STRING,modifiedtime STRING,status STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','stored AS textfile;CREATE TABLE orders_20190822 (orderid INT,createtime STRING,modifiedtime STRING,status STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','stored AS textfile;CREATE TABLE orders_20190823 (orderid INT,createtime STRING,modifiedtime STRING,status STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','stored AS textfile;/<code>

數據:

<code>2019-08-21訂單數據表(order)1,2019-08-18,2019-08-18,創建2,2019-08-18,2019-08-18,創建3,2019-08-19,2019-08-21,支付4,2019-08-19,2019-08-21,完成5,2019-08-19,2019-08-20,支付6,2019-08-20,2019-08-20,創建7,2019-08-20,2019-08-21,支付8,2019-08-21,2019-08-21,創建2019-08-22訂單數據表(order)*1,2019-08-18,2019-08-22,支付*2,2019-08-18,2019-08-22,完成3,2019-08-19,2019-08-21,支付4,2019-08-19,2019-08-21,完成5,2019-08-19,2019-08-20,支付*6,2019-08-20,2019-08-22,支付7,2019-08-20,2019-08-21,支付*8,2019-08-21,2019-08-22,支付*9,2019-08-22,2019-08-22,創建*10,2019-08-22,2019-08-22,支付2019-08-23訂單數據表1,2019-08-18,2019-08-23,完成2,2019-08-18,2019-08-22,完成3,2019-08-19,2019-08-23,完成4,2019-08-19,2019-08-21,完成5,2019-08-19,2019-08-23,完成6,2019-08-20,2019-08-22,支付7,2019-08-20,2019-08-21,支付8,2019-08-21,2019-08-23,完成9,2019-08-22,2019-08-22,創建10,2019-08-22,2019-08-22,支付11,2019-08-23,2019-08-23,創建12,2019-08-23,2019-08-23,創建13,2019-08-23,2019-08-23,支付/<code>
<code>load data  inpath '/user/root/csii/orders_20190821' into table orders_20190821;load data  inpath '/user/root/csii/orders_20190822' into table orders_20190822;load data  inpath '/user/root/csii/orders_20190823' into table orders_20190823;/<code>

ODS層設計:

在數據倉庫的ODS層,有一張訂單的增量數據表,按天分區,存放每天的增量數據:

<code>CREATE TABLE ods_orders_inc (orderid INT,createtime STRING,modifiedtime STRING,status STRING)PARTITIONED BY (day STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','stored AS textfile;/<code>

DW層設計:

在數據倉庫的DW層,有一張訂單的歷史數據拉鍊表,存放訂單的歷史狀態數據:

<code>CREATE TABLE dw_orders_his (orderid INT,createtime STRING,modifiedtime STRING,status STRING,dw_start_date STRING,dw_end_date STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','stored AS textfile;/<code>

第一步,抽取全量數據到ODS:

<code>INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2019-08-20')SELECT orderid,createtime,modifiedtime,statusFROM orders_20190821WHERE createtime <= '2019-08-20';/<code>

第二步,從ODS刷新到DW:

<code>INSERT overwrite TABLE dw_orders_hisSELECT orderid,createtime,modifiedtime,status,createtime AS dw_start_date,'9999-12-31' AS dw_end_dateFROM ods_orders_incWHERE day = '2019-08-20';/<code>

完成後,DW訂單歷史表中數據:

<code>hive> select * from dw_orders_his;1       2019-08-18      2019-08-18      創建    2019-08-18      9999-12-31      2       2019-08-18      2019-08-18      創建    2019-08-18      9999-12-313       2019-08-19      2019-08-21      支付    2019-08-19      9999-12-314       2019-08-19      2019-08-21      完成    2019-08-19      9999-12-315       2019-08-19      2019-08-20      支付    2019-08-19      9999-12-316       2019-08-20      2019-08-20      創建    2019-08-20      9999-12-317       2019-08-20      2019-08-21      支付    2019-08-20      9999-12-31/<code>

增量抽取

每天,從源系統訂單表中,將前一天的增量數據抽取到ODS層的增量數據表。 這裡的增量需要通過訂單表中的創建時間和修改時間來確定:

<code>INSERT overwrite TABLE ods_orders_inc PARTITION (day = '${day}')SELECT orderid,createtime,modifiedtime,statusFROM ordersWHERE createtime = '${day}' OR modifiedtime = '${day}';/<code>

增量刷新歷史數據

從2019-08-22開始,需要每天正常刷新前一天(2019-08-21)的增量數據到歷史表。

第一步,通過增量抽取,將2019-08-21的數據抽取到ODS:

<code>INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2019-08-21')SELECT orderid,createtime,modifiedtime,statusFROM orders_20190821WHERE modifiedtime = '2019-08-21' or createtime = '2019-08-21';/<code>

ODS增量表中2019-08-21的數據如下:

<code>hive> select * from ods_orders_inc where day = '2019-08-21';3       2019-08-19      2019-08-21      支付    2019-08-214       2019-08-19      2019-08-21      完成    2019-08-217       2019-08-20      2019-08-21      支付    2019-08-218       2019-08-21      2019-08-21      創建    2019-08-21/<code>

第二步,通過DW歷史數據(數據日期為2019-08-20),和ODS增量數據(2019-08-21),刷新歷史表:**

先把數據放到一張臨時表中:

<code>DROP TABLE IF EXISTS dw_orders_his_tmp;CREATE TABLE dw_orders_his_tmp AS SELECT orderid,createtime,modifiedtime,status,dw_start_date,dw_end_date FROM (    SELECT a.orderid,    a.createtime,    a.modifiedtime,    a.status,    a.dw_start_date,    CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2019-08-21' THEN '2019-08-20' ELSE a.dw_end_date END AS dw_end_date     FROM dw_orders_his a     left outer join (SELECT * FROM ods_orders_inc WHERE day = '2019-08-21') b     ON (a.orderid = b.orderid)     UNION ALL     SELECT orderid,    createtime,    modifiedtime,    status,    modifiedtime AS dw_start_date,    '9999-12-31' AS dw_end_date     FROM ods_orders_inc     WHERE day = '2019-08-21' ) x ORDER BY orderid,dw_start_date;/<code>

最後把臨時表中數據插入歷史表:

<code>INSERT overwrite TABLE dw_orders_hisSELECT * FROM dw_orders_his_tmp;/<code>

查看拉鍊表數據:


數據倉庫-拉鍊表

將2019-08-22的增量數據刷新到歷史表:

<code>第一步,通過增量抽取,將2019-08-22的數據抽取到ODS:INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2019-08-22') SELECT orderid,createtime,modifiedtime,status FROM orders_20190822WHERE createtime = '2019-08-22' OR modifiedtime = '2019-08-22';第二步,通過DW歷史數據(數據日期為2019-08-21),和ODS增量數據(2019-08-22)DROP TABLE IF EXISTS dw_orders_his_tmp;CREATE TABLE dw_orders_his_tmp AS SELECT orderid,createtime,modifiedtime,status,dw_start_date,dw_end_date FROM (    SELECT a.orderid,    a.createtime,    a.modifiedtime,    a.status,    a.dw_start_date,    CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2019-08-22' THEN '2019-08-21' ELSE a.dw_end_date END AS dw_end_date     FROM dw_orders_his a     left outer join (SELECT * FROM ods_orders_inc WHERE day = '2019-08-22') b     ON (a.orderid = b.orderid)     UNION ALL     SELECT orderid,    createtime,    modifiedtime,    status,    modifiedtime AS dw_start_date,    '9999-12-31' AS dw_end_date     FROM ods_orders_inc     WHERE day = '2019-08-22' ) x ORDER BY orderid,dw_start_date; 第三步:最後把臨時表中數據插入歷史表:INSERT overwrite TABLE dw_orders_his SELECT * FROM dw_orders_his_tmp;/<code>

刷新完後歷史表數據如下:


數據倉庫-拉鍊表


查看2019-08-21的歷史快照數據:

<code>select * from dw_orders_his where dw_start_date <= '2019-08-21' and dw_end_date >= '2019-08-21';/<code>


將2019-08-23的增量數據刷新到歷史表:

<code>第一步,通過增量抽取,將2019-08-22的數據抽取到ODS:INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2019-08-23') SELECT orderid,createtime,modifiedtime,status FROM orders_20190823WHERE createtime = '2019-08-23' OR modifiedtime = '2019-08-23';第二步,通過DW歷史數據(數據日期為2019-08-22),和ODS增量數據(2019-08-23)DROP TABLE IF EXISTS dw_orders_his_tmp;CREATE TABLE dw_orders_his_tmp AS SELECT orderid,createtime,modifiedtime,status,dw_start_date,dw_end_date FROM (    SELECT a.orderid,    a.createtime,    a.modifiedtime,    a.status,    a.dw_start_date,    CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2019-08-23' THEN '2019-08-22' ELSE a.dw_end_date END AS dw_end_date     FROM dw_orders_his a     left outer join (SELECT * FROM ods_orders_inc WHERE day = '2019-08-23') b     ON (a.orderid = b.orderid)     UNION ALL     SELECT orderid,    createtime,    modifiedtime,    status,    modifiedtime AS dw_start_date,    '9999-12-31' AS dw_end_date     FROM ods_orders_inc     WHERE day = '2019-08-23' ) x ORDER BY orderid,dw_start_date;第三步:最後把臨時表中數據插入歷史表:INSERT overwrite TABLE dw_orders_his SELECT * FROM dw_orders_his_tmp;/<code>


<code>hive> select * from dw_orders_his order by orderid,dw_start_date;1       2019-08-18      2019-08-18      創建    2019-08-18      2019-08-211       2019-08-18      2019-08-22      支付    2019-08-22      2019-08-221       2019-08-18      2019-08-23      完成    2019-08-23      9999-12-312       2019-08-18      2019-08-18      創建    2019-08-18      2019-08-212       2019-08-18      2019-08-22      完成    2019-08-22      9999-12-313       2019-08-19      2019-08-21      支付    2019-08-19      2019-08-203       2019-08-19      2019-08-21      支付    2019-08-21      2019-08-223       2019-08-19      2019-08-23      完成    2019-08-23      9999-12-314       2019-08-19      2019-08-21      完成    2019-08-19      2019-08-204       2019-08-19      2019-08-21      完成    2019-08-21      9999-12-315       2019-08-19      2019-08-20      支付    2019-08-19      2019-08-225       2019-08-19      2019-08-23      完成    2019-08-23      9999-12-316       2019-08-20      2019-08-20      創建    2019-08-20      2019-08-216       2019-08-20      2019-08-22      支付    2019-08-22      9999-12-317       2019-08-20      2019-08-21      支付    2019-08-20      2019-08-207       2019-08-20      2019-08-21      支付    2019-08-21      9999-12-318       2019-08-21      2019-08-21      創建    2019-08-21      2019-08-218       2019-08-21      2019-08-22      支付    2019-08-22      2019-08-228       2019-08-21      2019-08-23      完成    2019-08-23      9999-12-319       2019-08-22      2019-08-22      創建    2019-08-22      9999-12-3110      2019-08-22      2019-08-22      支付    2019-08-22      9999-12-3111      2019-08-23      2019-08-23      創建    2019-08-23      9999-12-3112      2019-08-23      2019-08-23      創建    2019-08-23      9999-12-3113      2019-08-23      2019-08-23      支付    2019-08-23      9999-12-31/<code>

查看最新的狀態:

<code>hive> select * from dw_orders_his where dw_end_date = '9999-12-31';   /<code>


數據倉庫-拉鍊表


分享到:


相關文章: