kaggle | 商城客户细分数据

@Author:BY Runsen

@Date:2019年06月09日

无聊看下kaggle,发现了一个不错 的数据集

您有超市购物中心和会员卡,您可以获得有关客户的一些基本数据,如客户ID,年龄,性别,年收入和支出分数。消费分数是您根据定义的参数(如客户行为和购买数据)分配给客户的分数。

问题陈述 您拥有购物中心并希望了解哪些客户可以轻松融合目标客户,以便可以向营销团队提供意见并相应地制定策略

数据集是要根据最后两个特征,来判断是否给会员卡,在生活挺常见的,典型的无监督学习,用k-means他们分类

<code>import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import os print(os.listdir("../input")) /<code>

<code>['Mall_Customers.csv'] /<code>

<code>import numpy as np import matplotlib.pyplot as plt import pandas as pd import warnings import seaborn as sns from sklearn.preprocessing import LabelEncoder warnings.filterwarnings('ignore') /<code>

<code>data=pd.read_csv('../input/Mall_Customers.csv') data.head() /<code>

<code>X=data.iloc[:,[3,4]].values # 将年度收入和支出分数作为特征 /<code>

求最优聚类数

<code>from sklearn.cluster import KMeans wcss=[] for i in range(1,11): kmeans=KMeans(n_clusters=i,init='k-means++',max_iter=300,n_init=10,random_state=0) kmeans.fit(X) wcss.append(kmeans.inertia_) plt.plot(range(1,11),wcss) plt.title('The Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('WCSS') plt.show() /<code>

看出就是5,因为5是折点

<code>kmeans=KMeans(n_clusters=5,init='k-means++',max_iter=300,n_init=10,random_state=0) y_kmeans=kmeans.fit_predict(X) /<code>

<code>plt.scatter(X[y_kmeans==0,0],X[y_kmeans==0,1],s=100,c='magenta',label='Careful') plt.scatter(X[y_kmeans==1,0],X[y_kmeans==1,1],s=100,c='yellow',label='Standard') plt.scatter(X[y_kmeans==2,0],X[y_kmeans==2,1],s=100,c='green',label='Target') plt.scatter(X[y_kmeans==3,0],X[y_kmeans==3,1],s=100,c='cyan',label='Careless') plt.scatter(X[y_kmeans==4,0],X[y_kmeans==4,1],s=100,c='burlywood',label='Sensible') plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],s=300,c='red',label='Centroids') plt.title('Cluster of Clients') plt.xlabel('Annual Income (k$)') plt.ylabel('Spending Score (1-100)') plt.legend() plt.show /<code>

五个分类

<code>Cluster 1- High income low spending =Careful Cluster 2- Medium income medium spending =Standard Cluster 3- High Income and high spending =Target Cluster 4- Low Income and high spending =Careless Cluster 5- Low Income and low spending =Sensible /<code>

比较男和女

<code>sns.lmplot(x='Age', y='Spending Score (1-100)', data=data,fit_reg=True,hue='Gender') plt.show() /<code>

年龄分布

<code>data.sort_values(['Age']) plt.figure(figsize=(10,8)) plt.bar(data['Age'],data['Spending Score (1-100)']) plt.xlabel('Age') plt.ylabel('Spending Score') plt.show() /<code>

男人和女人花在20多岁和30多岁的时候,因为在以后的阶段,消费变小了。

男变为1,女0

<code>label_encoder=LabelEncoder() integer_encoded=label_encoder.fit_transform(data.iloc[:,1].values) data['Gender']=integer_encoded data.head() /<code>

<code>hm=sns.heatmap(data.iloc[:,1:5].corr(), annot = True, linewidths=.5, cmap='Blues') hm.set_title(label='Heatmap of dataset', fontsize=20) hm plt.ioff() /<code>

看了下其他人的代码,学习一下

有人分成3类

<code>dataset_1 = data.iloc[:,1:5] dataset_1.head(10) /<code>

<code>results = [] for i in range(1,10): kmeans = KMeans(n_clusters=i, init='k-means++') res = kmeans.fit(dataset_1) results.append(res.score(dataset_1)) plt.plot(range(1,10),results) plt.xlabel('Num Clusters') plt.ylabel('score') plt.title('Elbow Curve') /<code>

应该是无关数据影响了

<code>dataset_2 = dataset[:,3:5] dataset_2.head(10) /<code>

<code>results = [] for i in range(1,10): kmeans = KMeans(n_clusters=i, init='k-means++') res = kmeans.fit(dataset_2) results.append(res.score(dataset_2)) plt.plot(range(1,10),results) plt.xlabel('Num Clusters') plt.ylabel('score') plt.title('Elbow Curve') /<code>

数据集链接:

https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python