數據倉庫-拉鍊表

在數據倉庫的數據模型設計過程中,經常會遇到這樣的需求:

1.數據量比較大;

2.表中的部分字段會被update,如用戶的地址,產品的描述信息,訂單的狀態等等;

3.需要查看某一個時間點或者時間段的歷史快照信息,比如,查看某一個訂單在歷史某一個時間點的狀態, 比如,查看某一個用戶在過去某一段時間內,更新過幾次等等;

4.變化的比例和頻率不是很大,比如,總共有1000萬的會員,每天新增和發生變化的有10萬左右;

5.如果對這邊表每天都保留一份全量,那麼每次全量中會保存很多不變的信息,對存儲是極大的浪費;遇到些問題在沒解決嗎?

拉鍊歷史表,既能滿足反應數據的歷史狀態,又可以最大程度的節省存儲。

拉鍊表案例:

1、有一張訂單表,2019-06-20 這天裡面有三條訂單數據:


數據倉庫-拉鍊表

到了2019-06-21這天,表中有5條記錄:

數據倉庫-拉鍊表

到了2019-06-22日,表中有6條記錄:


數據倉庫-拉鍊表

這種存儲方式有什麼問題呢?

1、只保留一份全量,則數據和6月22日的記錄一樣,如果需要查看6月21日訂單001的狀態,則無法滿足;

2、每天都保留一份全量,則數據倉庫中的該表共有14條記錄,但好多記錄都是重複保存,沒有任務變化,如訂單002,004,數據量大了,會造成很大的存儲浪費;

如果設計成歷史拉鍊表,如下:


數據倉庫-拉鍊表

拉鍊表解釋:

說明:

1.dw_begin_date表示該條記錄的生命週期開始時間,dw_end_date表示該條記錄的生命週期結束時間;

2.dw_end_date = '9999-12-31'表示該條記錄目前處於有效狀態;

3.如果查詢當前所有有效的記錄,則select * from order_his where dw_end_date = '9999-12-31'

4.如果查詢2012-06-21的歷史快照,則select * from order_his where dw_begin_date <= '2012-06-21' and dw_end_date >= '2012-06-21',這條語句會查詢到以下記錄:


數據倉庫-拉鍊表

可以看出,這樣的歷史拉鍊表,既能滿足對歷史數據的需求,又能很大程度的節省存儲資源;


拉鍊表實戰:

源系統中訂單表結構為:

<code>CREATE TABLE orders_20190821 (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored AS textfile;

CREATE TABLE orders_20190822 (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored AS textfile;

CREATE TABLE orders_20190823 (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored AS textfile;/<code>

數據:

<code>2019-08-21訂單數據表(order)
1,2019-08-18,2019-08-18,創建
2,2019-08-18,2019-08-18,創建
3,2019-08-19,2019-08-21,支付
4,2019-08-19,2019-08-21,完成
5,2019-08-19,2019-08-20,支付
6,2019-08-20,2019-08-20,創建
7,2019-08-20,2019-08-21,支付
8,2019-08-21,2019-08-21,創建

2019-08-22訂單數據表(order)
*1,2019-08-18,2019-08-22,支付

*2,2019-08-18,2019-08-22,完成
3,2019-08-19,2019-08-21,支付
4,2019-08-19,2019-08-21,完成
5,2019-08-19,2019-08-20,支付
*6,2019-08-20,2019-08-22,支付
7,2019-08-20,2019-08-21,支付
*8,2019-08-21,2019-08-22,支付
*9,2019-08-22,2019-08-22,創建
*10,2019-08-22,2019-08-22,支付


2019-08-23訂單數據表
1,2019-08-18,2019-08-23,完成
2,2019-08-18,2019-08-22,完成
3,2019-08-19,2019-08-23,完成
4,2019-08-19,2019-08-21,完成
5,2019-08-19,2019-08-23,完成
6,2019-08-20,2019-08-22,支付
7,2019-08-20,2019-08-21,支付
8,2019-08-21,2019-08-23,完成
9,2019-08-22,2019-08-22,創建
10,2019-08-22,2019-08-22,支付
11,2019-08-23,2019-08-23,創建
12,2019-08-23,2019-08-23,創建
13,2019-08-23,2019-08-23,支付/<code>
<code>load data  inpath '/user/root/csii/orders_20190821' into table orders_20190821;
load data inpath '/user/root/csii/orders_20190822' into table orders_20190822;
load data inpath '/user/root/csii/orders_20190823' into table orders_20190823;/<code>

ODS層設計:

在數據倉庫的ODS層,有一張訂單的增量數據表,按天分區,存放每天的增量數據:

<code>CREATE TABLE ods_orders_inc (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING
)PARTITIONED BY (day STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored AS textfile;/<code>

DW層設計:

在數據倉庫的DW層,有一張訂單的歷史數據拉鍊表,存放訂單的歷史狀態數據:

<code>CREATE TABLE dw_orders_his (
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING,
dw_start_date STRING,
dw_end_date STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored AS textfile;/<code>

第一步,抽取全量數據到ODS:

<code>INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2019-08-20')
SELECT orderid,createtime,modifiedtime,status
FROM orders_20190821
WHERE createtime <= '2019-08-20';/<code>

第二步,從ODS刷新到DW:

<code>INSERT overwrite TABLE dw_orders_his
SELECT orderid,createtime,modifiedtime,status,
createtime AS dw_start_date,
'9999-12-31' AS dw_end_date
FROM ods_orders_inc
WHERE day = '2019-08-20';/<code>

完成後,DW訂單歷史表中數據:

<code>hive> select * from dw_orders_his;
1 2019-08-18 2019-08-18 創建 2019-08-18 9999-12-31
2 2019-08-18 2019-08-18 創建 2019-08-18 9999-12-31
3 2019-08-19 2019-08-21 支付 2019-08-19 9999-12-31
4 2019-08-19 2019-08-21 完成 2019-08-19 9999-12-31
5 2019-08-19 2019-08-20 支付 2019-08-19 9999-12-31
6 2019-08-20 2019-08-20 創建 2019-08-20 9999-12-31
7 2019-08-20 2019-08-21 支付 2019-08-20 9999-12-31/<code>

增量抽取

每天,從源系統訂單表中,將前一天的增量數據抽取到ODS層的增量數據表。 這裡的增量需要通過訂單表中的創建時間和修改時間來確定:

<code>INSERT overwrite TABLE ods_orders_inc PARTITION (day = '${day}')
SELECT orderid,createtime,modifiedtime,status
FROM orders
WHERE createtime = '${day}' OR modifiedtime = '${day}';/<code>

增量刷新歷史數據

從2019-08-22開始,需要每天正常刷新前一天(2019-08-21)的增量數據到歷史表。

第一步,通過增量抽取,將2019-08-21的數據抽取到ODS:

<code>INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2019-08-21')
SELECT orderid,createtime,modifiedtime,status
FROM orders_20190821
WHERE modifiedtime = '2019-08-21' or createtime = '2019-08-21';/<code>

ODS增量表中2019-08-21的數據如下:

<code>hive> select * from ods_orders_inc where day = '2019-08-21';
3 2019-08-19 2019-08-21 支付 2019-08-21
4 2019-08-19 2019-08-21 完成 2019-08-21
7 2019-08-20 2019-08-21 支付 2019-08-21
8 2019-08-21 2019-08-21 創建 2019-08-21/<code>

第二步,通過DW歷史數據(數據日期為2019-08-20),和ODS增量數據(2019-08-21)

,刷新歷史表:**

先把數據放到一張臨時表中:

<code>DROP TABLE IF EXISTS dw_orders_his_tmp;
CREATE TABLE dw_orders_his_tmp AS
SELECT orderid,
createtime,
modifiedtime,
status,
dw_start_date,
dw_end_date
FROM (
SELECT a.orderid,
a.createtime,
a.modifiedtime,
a.status,
a.dw_start_date,
CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2019-08-21' THEN '2019-08-20' ELSE a.dw_end_date END AS dw_end_date
FROM dw_orders_his a
left outer join (SELECT * FROM ods_orders_inc WHERE day = '2019-08-21') b
ON (a.orderid = b.orderid)
UNION ALL
SELECT orderid,
createtime,
modifiedtime,
status,
modifiedtime AS dw_start_date,
'9999-12-31' AS dw_end_date
FROM ods_orders_inc
WHERE day = '2019-08-21'
) x
ORDER BY orderid,dw_start_date;/<code>

最後把臨時表中數據插入歷史表:

<code>INSERT overwrite TABLE dw_orders_his
SELECT * FROM dw_orders_his_tmp;/<code>

查看拉鍊表數據:


數據倉庫-拉鍊表

將2019-08-22的增量數據刷新到歷史表:

<code>第一步,通過增量抽取,將2019-08-22的數據抽取到ODS:

INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2019-08-22')
SELECT orderid,createtime,modifiedtime,status
FROM orders_20190822
WHERE createtime = '2019-08-22' OR modifiedtime = '2019-08-22';

第二步,通過DW歷史數據(數據日期為2019-08-21),和ODS增量數據(2019-08-22)
DROP TABLE IF EXISTS dw_orders_his_tmp;
CREATE TABLE dw_orders_his_tmp AS
SELECT orderid,
createtime,
modifiedtime,
status,
dw_start_date,
dw_end_date
FROM (
SELECT a.orderid,
a.createtime,
a.modifiedtime,
a.status,
a.dw_start_date,
CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2019-08-22' THEN '2019-08-21' ELSE a.dw_end_date END AS dw_end_date
FROM dw_orders_his a
left outer join (SELECT * FROM ods_orders_inc WHERE day = '2019-08-22') b
ON (a.orderid = b.orderid)
UNION ALL
SELECT orderid,
createtime,
modifiedtime,
status,
modifiedtime AS dw_start_date,
'9999-12-31' AS dw_end_date
FROM ods_orders_inc
WHERE day = '2019-08-22'
) x
ORDER BY orderid,dw_start_date;

第三步:最後把臨時表中數據插入歷史表:

INSERT overwrite TABLE dw_orders_his
SELECT * FROM dw_orders_his_tmp;/<code>

刷新完後歷史表數據如下:


數據倉庫-拉鍊表


查看2019-08-21的歷史快照數據:

<code>select * from dw_orders_his where dw_start_date <= '2019-08-21' and dw_end_date >= '2019-08-21';/<code>


將2019-08-23的增量數據刷新到歷史表:

<code>第一步,通過增量抽取,將2019-08-22的數據抽取到ODS:

INSERT overwrite TABLE ods_orders_inc PARTITION (day = '2019-08-23')
SELECT orderid,createtime,modifiedtime,status
FROM orders_20190823
WHERE createtime = '2019-08-23' OR modifiedtime = '2019-08-23';

第二步,通過DW歷史數據(數據日期為2019-08-22),和ODS增量數據(2019-08-23)
DROP TABLE IF EXISTS dw_orders_his_tmp;
CREATE TABLE dw_orders_his_tmp AS
SELECT orderid,
createtime,
modifiedtime,
status,
dw_start_date,
dw_end_date
FROM (
SELECT a.orderid,
a.createtime,
a.modifiedtime,
a.status,
a.dw_start_date,
CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2019-08-23' THEN '2019-08-22' ELSE a.dw_end_date END AS dw_end_date
FROM dw_orders_his a
left outer join (SELECT * FROM ods_orders_inc WHERE day = '2019-08-23') b
ON (a.orderid = b.orderid)
UNION ALL
SELECT orderid,
createtime,
modifiedtime,
status,
modifiedtime AS dw_start_date,
'9999-12-31' AS dw_end_date
FROM ods_orders_inc
WHERE day = '2019-08-23'
) x
ORDER BY orderid,dw_start_date;

第三步:最後把臨時表中數據插入歷史表:

INSERT overwrite TABLE dw_orders_his
SELECT * FROM dw_orders_his_tmp;/<code>


<code>hive> select * from dw_orders_his order by orderid,dw_start_date;
1 2019-08-18 2019-08-18 創建 2019-08-18 2019-08-21
1 2019-08-18 2019-08-22 支付 2019-08-22 2019-08-22
1 2019-08-18 2019-08-23 完成 2019-08-23 9999-12-31
2 2019-08-18 2019-08-18 創建 2019-08-18 2019-08-21
2 2019-08-18 2019-08-22 完成 2019-08-22 9999-12-31
3 2019-08-19 2019-08-21 支付 2019-08-19 2019-08-20
3 2019-08-19 2019-08-21 支付 2019-08-21 2019-08-22
3 2019-08-19 2019-08-23 完成 2019-08-23 9999-12-31
4 2019-08-19 2019-08-21 完成 2019-08-19 2019-08-20
4 2019-08-19 2019-08-21 完成 2019-08-21 9999-12-31
5 2019-08-19 2019-08-20 支付 2019-08-19 2019-08-22
5 2019-08-19 2019-08-23 完成 2019-08-23 9999-12-31
6 2019-08-20 2019-08-20 創建 2019-08-20 2019-08-21
6 2019-08-20 2019-08-22 支付 2019-08-22 9999-12-31
7 2019-08-20 2019-08-21 支付 2019-08-20 2019-08-20
7 2019-08-20 2019-08-21 支付 2019-08-21 9999-12-31
8 2019-08-21 2019-08-21 創建 2019-08-21 2019-08-21
8 2019-08-21 2019-08-22 支付 2019-08-22 2019-08-22
8 2019-08-21 2019-08-23 完成 2019-08-23 9999-12-31
9 2019-08-22 2019-08-22 創建 2019-08-22 9999-12-31
10 2019-08-22 2019-08-22 支付 2019-08-22 9999-12-31
11 2019-08-23 2019-08-23 創建 2019-08-23 9999-12-31
12 2019-08-23 2019-08-23 創建 2019-08-23 9999-12-31
13 2019-08-23 2019-08-23 支付 2019-08-23 9999-12-31/<code>

查看最新的狀態:

<code>hive> select * from dw_orders_his where dw_end_date = '9999-12-31';   
/<code>


數據倉庫-拉鍊表


分享到:


相關文章: