決策樹

決策樹是監督學習中最受歡迎和最強大的算法之一。與其他監督學習算法不同，決策樹可用於分類和迴歸問題。它是用於機器學習，統計和數據挖掘的預測模型。決策樹是樹，其中每個節點表示特徵（屬性），每個鏈接（分支）表示決策（規則），每個葉象徵結果。

使用決策樹算法是因為它模仿了人類思維，因此很容易識別數據並使決策樹的預測和邏輯清晰易懂。

不需要數據預處理，也不需要任何變量變換，不受異常值的影響。

有不同類型的決策樹算法

1. ID3算法（使用熵和信息增益作為指標）

2. C4.5算法（使用增益比作為指標）

3. CART算法（使用基尼指數作為指標）

分類和迴歸樹（CART）算法是一種基於Gini雜質指數作為分裂準則構建決策樹的分類算法。基尼指數是衡量隨機選擇的元素被錯誤識別的頻率的指標。Sklearn支持基尼指數的“基尼”標準，默認情況下; 它需要“Gini”的價值。Scikit-learn使用CART算法的優化版本。

決策樹的實現

決策樹實現

由於使用了Kaggle網站的數據集Spotify分類數據集（https://www.kaggle.com/geomack/spotifyclassification）。Kaggle是一個預測建模和分析競賽的平臺，統計人員和數據挖掘者在競爭中生成預測和描述公司和用戶上傳的數據集的最佳模型。

Spotify歌曲屬性數據集是包含2017歌曲及其屬性的數據集。數據集創建者已從Spotify的API生成數據集。

數據集包含16列：歌曲名稱，藝術家姓名，作為標籤的目標和13個軌道屬性：聲學，舞蹈，持續時間，能量，樂器，鍵，活力，響度，模式，語音，節奏，時間標記，效價，歌曲名稱，藝術家姓名和標籤作為目標。

決策樹的Python代碼如下

#Import the necessary python machine learning libraries
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.metrics import accuracy_score
import seaborn as sns
import graphviz
import pydotplus
import io
import imageio
#Import the data set
data=pd.read_csv(‘data.csv’)
#Split the dataset into training set and test set
train,test=train_test_split(data,test_size=0.15)
classifier=DecisionTreeClassifier(min_samples_split=100)
features=[“danceability”,”loudness”,”valence”,”energy”,”instrumentalness”,”acousticness”,”key”,”speechiness”,”duration_ms”,”liveness”,”mode”,”tempo”,”valence”]
#Traing the Model and Test the Model
X_train=train[features]
y_train=train[“target”]
X_test=test[features]
y_test=test[“target”]
#Decision Tree
dt=classifier.fit(X_train,y_train)
#Visualize the decision tree
def show_tree(tree,features,path):
f=io.StringIO()
export_graphviz(tree,out_file=f,feature_names=features)
pydotplus.graph_from_dot_data(f.getvalue()).write_png(path)
img=imageio.imread(path)

plt.rcParams[“figure.figsize”]=(20,20)
plt.imshow(img)
show_tree(dt, features,”Decision_Tree_Spotify_Songs.png”)
y_pred=classifier.predict(X_test)
score=accuracy_score(y_test,y_pred)*100
print(“Accuracy using the Decision Tree”,round(score,2),”%”)