使用商店、促銷和競爭對手數據預測銷售
Rossmann在7個歐洲國家經營著3000多家藥店。目前,Rossmann商店經理的任務是提前6周預測他們的日銷售額。商店的銷售受到許多因素的影響,包括促銷、競爭、學校和國家假日、季節性和地域性。由於數以千計的管理者根據自己的特殊情況預測銷售,結果的準確性可能會有很大的差異。在他們的第一次Kaggle競爭中,Rossmann要求預測德國1115家商店的6週日銷售額。可靠的銷售預測使商店經理能夠制定有效的員工時間表,提高生產力和積極性。
訓練集、測試集文件
train.csv-包括銷售在內的歷史數據
test.csv-不包括銷售的歷史數據
sample_submission.csv-格式正確的示例提交文件
store.csv-有關存儲的補充信息
數據字段描述
Store - 每個商店的唯一ID
Sales - 銷售額
Customers - 銷售客戶數
Open - 商店是否營業 0=關閉,1=開業
StateHoliday - 假日。通常所有商店都在國定假日關門. a = 公共假日, b = 復活節假日, c = 聖誕節, 0 = 無
SchoolHoliday - 學校假期
StoreType - 店鋪類型: a, b, c, d
Assortment - 產品組合級別: a = 基本, b = 附加, c = 擴展
CompetitionDistance - 距離最近的競爭對手距離(米)
CompetitionOpenSince[Month/Year] - 競爭對手開業年月 year and month of the time the nearest competitor was opened
Promo - 指店鋪當日是否在進行促銷
Promo2 - 指店鋪是否在進行連續促銷 0 = 未參與, 1 = 正在參與
Promo2Since[Year/Week] - 商店開始參與Promo2的年和日曆周
一.導入數據
<code>import
numpyas
npimport
pandasas
pdimport
matplotlib.pyplotas
pltimport
seabornas
snsimport
xgboostas
xgbfrom
timeimport
timeimport
pickle store = pd.read_csv('store.csv'
) train = pd.read_csv('train.csv'
) test = pd.read_csv('test.csv'
)/<code>
<code>test
[test
['Open'
].isnull()]) test.fillna(1, inplace=True)'Sales'
] > 0] strain.loc[strain['Store'
] == 1, ['Date'
,'Sales'
]].plot(x='Date'
, y='Sales'
, title='Store1'
, figsize=(16, 4)) plt.show()/<code>
從圖中可以看出店鋪的銷售額是有周期性變化的,一年中11,12月份銷量相對較高,可能是季節因素或者促銷等原因
此外從2014年6-9月份的銷量來看,6,7月份的銷售趨勢與8,9月份類似,而我們需要預測的6周在2015年8,9月份,因此我們可以把2015年6,7月份最近6周的1115家店的數據留出作為測試數據,用於模型的優化和驗證
<code>train
= train[train['Sales'
] >0
] train = pd.merge(train, store,on
='Store'
, how='left'
) test = pd.merge(test, store,on
='Store'
, how='left'
) print(train.info
())/<code>
二.特徵工程
<code>for
data
in
[train, test]: # 將時間特徵進行拆分和轉化data
['year'
] =data
['Date'
].apply(lambda x: x.split('-'
)[0
])data
['year'
] =data
['year'
].astype(int)data
['month'
] =data
['Date'
].apply(lambda x: x.split('-'
)[1
])data
['month'
] =data
['month'
].astype(int)data
['day'
] =data
['Date'
].apply(lambda x: x.split('-'
)[2
])data
['day'
] =data
['day'
].astype(int) # 將'PromoInterval'
特徵轉化為'IsPromoMonth'
特徵,表示某天某店鋪是否處於促銷月,1
表示是,0
表示否 # 提示下:這裡儘量不要用循環,用這種廣播的形式,會快很多。循環可能會讓你等的想哭 month2str = {1
:'Jan'
,2
:'Feb'
,3
:'Mar'
,4
:'Apr'
,5
:'May'
,6
:'Jun'
,7
:'Jul'
,8
:'Aug'
,9
:'Sep'
,10
:'Oct'
,11
:'Nov'
,12
:'Dec'
}data
['monthstr'
] =data
['month'
].map(month2str)data
['IsPromoMonth'
] =data
.apply( lambda x:0
if
x['PromoInterval'
] ==0
else
1
if
x['monthstr'
]in
x['PromoInterval'
]else
0
, axis=1
) # 將存在其它字符表示分類的特徵轉化為數字 mappings = {'0'
:0
,'a'
:1
,'b'
:2
,'c'
:3
,'d'
:4
}data
['StoreType'
].replace(mappings, inplace=True)data
['Assortment'
].replace(mappings, inplace=True)data
['StateHoliday'
].replace(mappings, inplace=True) # 刪掉訓練和測試數據集中不需要的特徵 df_train = train.drop(['Date'
,'Customers'
,'Open'
,'PromoInterval'
,'monthstr'
], axis=1
) df_test = test.drop(['Id'
,'Date'
,'Open'
,'PromoInterval'
,'monthstr'
], axis=1
) # 如上所述,保留訓練集中最近六週的數據用於後續模型的測試 Xtrain = df_train[6
*7
*1115
:] Xtest = df_train[:6
*7
*1115
] # 大家從表上可以看下相關性 plt.subplots(figsize=(24
,20
)) sns.heatmap(df_train.corr(), cmap='RdYlGn'
, annot=True, vmin=-0.1
, vmax=0.1
, center=0
) plt.show()/<code>
<code>ytrain
= np.log1p(Xtrain['Sales'
])ytest
= np.log1p(Xtest['Sales'
])Xtrain
= Xtrain.drop(['Sales'
], axis=1
)Xtest
= Xtest.drop(['Sales'
], axis=1
)/<code>
三.模型構建
<code>def
rmspe
(y, yhat)
:return
np.sqrt(np.mean((yhat / y -1
) **2
))def
rmspe_xg
(yhat, y)
: y = np.expm1(y.get_label()) yhat = np.expm1(yhat)return
'rmspe'
, rmspe(y, yhat) params = {'objective'
:'reg:linear'
,'booster'
:'gbtree'
,'eta'
:0.03
,'max_depth'
:10
,'subsample'
:0.9
,'colsample_bytree'
:0.7
,'silent'
:1
,'seed'
:10
} num_boost_round =6000
dtrain = xgb.DMatrix(Xtrain, ytrain) dvalid = xgb.DMatrix(Xtest, ytest) watchlist = [(dtrain,'train'
), (dvalid,'eval'
)]/<code>
三.模型訓練
<code> print('Train a XGBoost model'
) start = time() gbm = xgb.train(params, dtrain, num_boost_round, evals=watchlist, early_stopping_rounds=100
, feval=rmspe_xg, verbose_eval=True
) pickle.dump(gbm, open("pima.pickle.dat"
,"wb"
)) end = time() print('Train time is {:.2f} s.'
.format(end - start))''' Train time is 923.86 s. 訓練花費15分鐘。。 '''
/<code>
四.結果優化
<code>gbm = pickle.load(open("pima.pickle.dat"
,"rb"
)) print('validating'
) Xtest.sort_index(inplace=True
) ytest.sort_index(inplace=True
) yhat = gbm.predict(xgb.DMatrix(Xtest)) error = rmspe(np.expm1(ytest), np.expm1(yhat)) print('RMSPE: {:.6f}'
.format(error))''' validating RMSPE: 0.128683 '''
res = pd.DataFrame(data=ytest) res['Predicition'
] = yhat res = pd.merge(Xtest, res, left_index=True
, right_index=True
) res['Ratio'
] = res['Predicition'
] / res['Sales'
] res['Error'
] = abs(res['Ratio'
] -1
) res['Weight'
] = res['Sales'
] / res['Predicition'
] res.head() col_1 = ['Sales'
,'Predicition'
] col_2 = ['Ratio'
] L = np.random.randint(low=1
, high=1115
, size=3
) print('Mean Ratio of predition and real sales data is {}:store all'
.format(res['Ratio'
].mean()))for
iin
L: s1 = pd.DataFrame(res[res['Store'
] == i], columns=col_1) s2 = pd.DataFrame(res[res['Store'
] == i], columns=col_2) s1.plot(title='Comparation of predition and real sales data:store {}'
.format(i), figsize=(12
,4
)) s2.plot(title='Ratio of predition and real sales data: store {}'
.format(i), figsize=(12
,4
)) print('Mean Ratio of predition and real sales data is {}:store {}'
.format(s2['Ratio'
].mean(), i)) res.sort_values(['Error'
], ascending=False
, inplace=True
) print(res[:10
]) /<code>
五.模型優化
<code> print('weight correction'
) W=[(0.990
+(i/1000
))for
iin
range(20
)] S=[]for
win
W: error=rmspe(np.expm1(ytest),np.expm1(yhat*w)) print('RMSPE for {:.3f}:{:.6f}'
.format(w,error)) S.append(error) Score=pd.Series(S,index=W) Score.plot() BS=Score[Score.values==Score.values.min()] print('Best weight for Score:{}'
.format(BS))''' weight correction RMSPE for 0.990:0.131899 RMSPE for 0.991:0.129076 RMSPE for 0.992:0.126723 …… Best weight for Score:0.996 0.122779 dtype: float64 '''
plt.show()/<code>
<code> L=range(1115
) W_ho=[] W_test=[]for
iin
L: s1=pd.DataFrame(res[res['Store'
]==i+1
],columns=col_1) s2=pd.DataFrame(df_test[df_test['Store'
]==i+1
]) W1=[(0.990
+(i/1000
))for
iin
range(20
)] S=[]for
win
W1: error=rmspe(np.expm1(s1['Sales'
]),np.expm1(s1['Predicition'
]*w)) S.append(error) Score=pd.Series(S,index=W1) BS=Score[Score.values==Score.values.min()] a=np.array(BS.index.values) b_ho=a.repeat(len(s1)) b_test=a.repeat(len(s2)) W_ho.extend(b_ho.tolist()) W_test.extend(b_test.tolist()) Xtest=Xtest.sort_values(by='Store'
) Xtest['W_ho'
]=W_ho Xtest=Xtest.sort_index() W_ho=list(Xtest['W_ho'
].values) Xtest.drop(['W_ho'
],axis=1
,inplace=True
) df_test=df_test.sort_values(by='Store'
) df_test['W_test'
]=W_test df_test=df_test.sort_index() W_test=list(df_test['W_test'
].values) df_test.drop(['W_test'
],axis=1
,inplace=True
) yhat_new=yhat*W_ho error=rmspe(np.expm1(ytest),np.expm1(yhat_new)) print('RMSPE for weight corretion {:.6f}'
.format(error))''' RMSPE for weight corretion 0.116168 相對於整體校正的0.122779的得分又有不小的提高 '''
/<code>