支持向量機(後記、附錄及文獻列表)

吳天暉/譯

後記

最後,我將引用Stuart Russel 和 Peter Norvig的話,他們寫道:

你可以說SVMs是成功的,因為它的關鍵洞察力和優雅的技巧。(Russell & Norvig, 2010)

這個的關鍵洞察事實上是指一些經驗比其他的重要得多。它們靠近決策邊緣,我們稱它們為支持向量。結果,我們發現這個最優超平面的泛化好於其他超平面,而且僅用支持向量就能構造。我們發現一個細節是我們需要去解一個凸優化問題來尋找這個超平面。

這個的優雅的技巧就是核技巧。它讓我們用SVMs解決非可分數據,如果沒有它,SVMs的應用將非常有限。我們看到這個技巧,可能剛開始比較難理解,事實上它非常簡單,而且可以重用於其他學習算法。

就這樣,如果你看完了這本書,你將能明白SVMs是如何工作的。另一個感興趣的問題是它們為什麼能工作?這個主題的領域叫計算學習理論(SVMs事實上來自統計學習理論)。如果你想知道更多關於這方面的知識,你可以跟隨精品課程(outstanding course,http://work.caltech.edu/telecourse.html)從數據中學習(Learning from Data (Abu-Mostafa, 2012) ),它對這個主題提供了一個非常好的介紹。

你需要知道SVMs不僅僅只是用來分類。單類SVM可以用來作異常偵測,支持向量迴歸可以用做迴歸分析。為了讓這本書簡潔所以它們不包括在本書中,但是它們機是引人關注的主題。現在你明白了基礎的SVMs,你可以有更好的準備去學習這些分支。

SVMs不能解決所有的問題,但是我希望現在它是你的機器學習工具箱中的一個工作,一個你已經理解的工具,並且樂於使用。

附錄A:數據集

線性可分數據集

下面的代碼列出本書大部分章節都用到的簡單線性可分數據集。你可以在Bitbucket repository(https://bitbucket.org/syncfusiontech/svm-succinctly)找到本書用到的其他數據集的源碼。

支持向量機(後記、附錄及文獻列表)

圖60: 訓練集

支持向量機(後記、附錄及文獻列表)

圖61:檢測集

當代碼引入代碼表51的模型時,在代碼52中將加載這個方法。

方法get_training_examples 返回圖60所示的數據,方法get_training_examples 返回圖61所示的數據。

代碼表51
from succinctly.datasets import *
代碼表52
import numpy as np
def get_training_examples():
X1 = np.array([[8, 7], [4, 10], [9, 7], [7, 10],
[9, 6], [4, 8], [10, 10]])
y1 = np.ones(len(X1))
X2 = np.array([[2, 7], [8, 3], [7, 5], [4, 4],
[4, 6], [1, 3], [2, 5]])
y2 = np.ones(len(X2)) * -1
return X1, y1, X2, y2
def get_test_examples():
X1 = np.array([[2, 9], [1, 10], [1, 11], [3, 9], [11, 5],
[10, 6], [10, 11], [7, 8], [8, 8], [4, 11],
[9, 9], [7, 7], [11, 7], [5, 8], [6, 10]])
X2 = np.array([[11, 2], [11, 3], [1, 7], [5, 5], [6, 4],
[9, 4],[2, 6], [9, 3], [7, 4], [7, 2], [4, 5],
[3, 6], [1, 6], [2, 3], [1, 1], [4, 2], [4, 3]])
y1 = np.ones(len(X1))
y2 = np.ones(len(X2)) * -1
return X1, y1, X2, y2

代碼表53是這個代碼的一個典型的應用。它用的get_dataset 方法來自代碼表54,用來載入datasets包。

代碼表53
from succinctly.datasets import get_dataset, linearly_separable as ls
# Get the training examples of the linearly separable dataset.
X, y = get_dataset(ls.get_training_examples)
代碼表54
import numpy as np
def get_dataset(get_examples):
X1, y1, X2, y2 = get_examples()

X, y = get_dataset_for(X1, y1, X2, y2)
return X, y def get_dataset_for(X1, y1, X2, y2):
X = np.vstack((X1, X2))
y = np.hstack((y1, y2))
return X, y
def get_generated_dataset(get_examples, n):
X1, y1, X2, y2 = get_examples(n)
X, y = get_dataset_for(X1, y1, X2, y2)
return X, y

附錄B:SMO算法

代碼表55
import numpy as np
from random import randrange
# Written from the pseudo-code in:
# http://luthuli.cs.uiuc.edu/~daf/courses/optimization/Papers/smoTR.pdf
class SmoAlgorithm:
def init (self, X, y, C, tol, kernel, use_linear_optim):
self.X = X
self.y = y
self.m, self.n = np.shape(self.X)
self.alphas = np.zeros(self.m)
self.kernel = kernel
self.C = C
self.tol = tol
self.errors = np.zeros(self.m)
self.eps = 1e-3 # epsilon
self.b = 0
self.w = np.zeros(self.n)
self.use_linear_optim = use_linear_optim
# Compute the SVM output for example i
# Note that Platt uses the convention w.x-b=0
# while we have been using w.x+b in the book.
def output(self, i):
if self.use_linear_optim:
# Equation 1
return float(np.dot(self.w.T, self.X[i])) - self.b
else:
# Equation 10
return np.sum([self.alphas[j] * self.y[j]
* self.kernel(self.X[j], self.X[i])
for j in range(self.m)]) - self.b
# Try to solve the problem analytically.
def take_step(self, i1, i2):
if i1 == i2:
return False
a1 = self.alphas[i1] y1 = self.y[i1]
X1 = self.X[i1]

E1 = self.get_error(i1)
s = y1 * self.y2
# Compute the bounds of the new alpha2.
if y1 != self.y2:
# Equation 13
L = max(0, self.a2 - a1)
H = min(self.C, self.C + self.a2 - a1)
else:
# Equation 14
L = max(0, self.a2 + a1 - self.C) H = min(self.C, self.a2 + a1)
if L == H:
return False
k11 = self.kernel(X1, X1)
k12 = self.kernel(X1, self.X[i2])
k22 = self.kernel(self.X[i2], self.X[i2])
# Compute the second derivative of the
# objective function along the diagonal.
# Equation 15
eta = k11 + k22 - 2 * k12
if eta > 0:
# Equation 16
a2_new = self.a2 + self.y2 * (E1 - self.E2) / eta
# Clip the new alpha so that is stays at the end of the line.
# Equation 17
if a2_new < L: a2_new = L
elif a2_new > H: a2_new = H
else:
# Under unusual cicumstances, eta will not be positive.
# Equation 19
f1 = y1 * (E1 + self.b) - a1 * k11 - s * self.a2 * k12 f2 = self.y2 * (self.E2 + self.b) - s * a1 * k12 \
- self.a2 * k22
L1 = a1 + s(self.a2 - L)
H1 = a1 + s * (self.a2 - H)
Lobj = L1 * f1 + L * f2 + 0.5 * (L1 ** 2) * k11 \
+ 0.5 * (L ** 2) * k22 + s * L * L1 * k12 Hobj = H1 * f1 + H * f2 + 0.5 * (H1 ** 2) * k11 \
+ 0.5 * (H ** 2) * k22 + s * H * H1 * k12
if Lobj < Hobj - self.eps: a2_new = L
elif Lobj > Hobj + self.eps: a2_new = H
else:
a2_new = self.a2
# If alpha2 did not change enough the algorithm
# returns without updating the multipliers.
if abs(a2_new - self.a2) < self.eps * (a2_new + self.a2 \
+ self.eps):
return False
# Equation 18
a1_new = a1 + s * (self.a2 - a2_new)
new_b = self.compute_b(E1, a1, a1_new, a2_new, k11, k12, k22, y1)
delta_b = new_b - self.b
self.b = new_b

# Equation 22
if self.use_linear_optim:
self.w = self.w + y1*(a1_new - a1)*X1 \
+ self.y2*(a2_new - self.a2) * self.X2
# Update the error cache using the new Lagrange multipliers.
delta1 = y1 * (a1_new - a1)
delta2 = self.y2 * (a2_new - self.a2)
# Update the error cache.
for i in range(self.m):
if 0 < self.alphas[i] < self.C:
self.errors[i] += delta1 * self.kernel(X1, self.X[i]) + \ delta2 * self.kernel(self.X2,self.X[i]) \
- delta_b
self.errors[i1] = 0
self.errors[i2] = 0
self.alphas[i1] = a1_new self.alphas[i2] = a2_new
return True
def compute_b(self, E1, a1, a1_new, a2_new, k11, k12, k22, y1):
# Equation 20
b1 = E1 + y1 * (a1_new - a1) * k11 + \
self.y2 * (a2_new - self.a2) * k12 + self.b
# Equation 21
b2 = self.E2 + y1 * (a1_new - a1) * k12 + \ self.y2 * (a2_new - self.a2) * k22 + self.b
if (0 < a1_new) and (self.C > a1_new): new_b = b1
elif (0 < a2_new) and (self.C > a2_new): new_b = b2
else:
new_b = (b1 + b2) / 2.0
return new_b
def get_error(self, i1):
if 0 < self.alphas[i1] < self.C:
return self.errors[i1]
else:
return self.output(i1) - self.y[i1]
def second_heuristic(self, non_bound_indices): i1 = -1
if len(non_bound_indices) > 1: max = 0
for j in non_bound_indices:
E1 = self.errors[j] - self.y[j]
step = abs(E1 - self.E2) # approximation
if step > max: max = step i1 = j
return i1
def examine_example(self, i2): self.y2 = self.y[i2] self.a2 = self.alphas[i2] self.X2 = self.X[i2] self.E2 = self.get_error(i2)
r2 = self.E2 * self.y2
if not((r2 < -self.tol and self.a2 < self.C) or
(r2 > self.tol and self.a2 > 0)):
# The KKT conditions are met, SMO looks at another example.
return 0
# Second heuristic A: choose the Lagrange multiplier which
# maximizes the absolute error.
non_bound_idx = list(self.get_non_bound_indexes()) i1 = self.second_heuristic(non_bound_idx)
if i1 >= 0 and self.take_step(i1, i2):
return 1

# Second heuristic B: Look for examples making positive
# progress by looping over all non-zero and non-C alpha,
# starting at a random point.
if len(non_bound_idx) > 0:
rand_i = randrange(len(non_bound_idx))
for i1 in non_bound_idx[rand_i:] + non_bound_idx[:rand_i]:
if self.take_step(i1, i2):
return 1
# Second heuristic C: Look for examples making positive progress
# by looping over all possible examples, starting at a random
# point.
rand_i = randrange(self.m) all_indices = list(range(self.m))
for i1 in all_indices[rand_i:] + all_indices[:rand_i]:
if self.take_step(i1, i2):
return 1
# Extremely degenerate circumstances, SMO skips the first example.
return 0
def error(self, i2):
return self.output(i2) - self.y2
def get_non_bound_indexes(self):
return np.where(np.logical_and(self.alphas > 0,
self.alphas < self.C))[0]
# First heuristic: loop over examples where alpha is not 0 and not C
# they are the most likely to violate the KKT conditions
# (the non-bound subset).
def first_heuristic(self): num_changed = 0
non_bound_idx = self.get_non_bound_indexes()
for i in non_bound_idx:
num_changed += self.examine_example(i)
return num_changed

代碼表56示範怎樣實例一個SmoAlgorithm對象,運行這個算法,和打印結果。

代碼表56
import numpy as np
from random import seed
from succinctly.datasets import linearly_separable, get_dataset
from succinctly.algorithms.smo_algorithm import SmoAlgorithm
def linear_kernel(x1, x2):
return np.dot(x1, x2)
def compute_w(multipliers, X, y):
return np.sum(multipliers[i] * y[i] * X[i] for i in range(len(y)))
if __name__ == '__main__':
seed(5) # to have reproducible results
X_data, y_data = get_dataset(linearly_separable.get_training_examples)
smo = SmoAlgorithm(X_data, y_data, C=10, tol=0.001, k

ernel=linear_kernel, use_linear_optim=True)
smo.main_routine()
w = compute_w(smo.alphas, X_data, y_data)
print('w = {}'.format(w))
# -smo.b because Platt uses the convention w.x-b=0
print('b = {}'.format(-smo.b))
# w = [0.4443664 1.1105648]
# b = -9.66268641132

文獻列表

Abu-Mostafa, Y. S. (2012). Learning From Data. AMLBook.

Biernat, E., & Lutz, M. (2016). Data science: fondamentaux et études de cas. Eyrolles. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.

Burges, C. J. (1988). A Tutorial on Support Vector Machines for Pattern. Data Mining and Knowledge Discovery, 121-167.

Crammer, K., & Singer, Y. (2001). On the Algorithmic Implementation of Multiclass Kernel- based Vector Machines. Journal of Machine Learning Research 2.

Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines.

Cambridge University Press.

Dogan, U., Glasmachers, T., & Igel, C. (2011). Fast Training of Multi-Class Support Vector Machines.

El Ghaoui, L. (2015). Optimization Models and Applications. Retrieved from http://livebooklabs.com/keeppies/c5a5868ce26b8125

Gershwin, S. B. (2010). KKT Examples. Retrieved from MIT Mechanical Engineering Course: http://ocw.mit.edu/courses/mechanical-engineering/2-854-introduction-to-manufacturing- systems-fall-2010/lecture-notes/MIT2_854F10_kkt_ex.pdf

Gretton, A. (2016, 03 05). Lecture 9: Support Vector Machines . Retrieved from http://www.gatsby.ucl.ac.uk/~gretton/coursefiles/Slides5A.pdf

Han, X., & Berg, A. C. (2012). DCMSVM: Distributed Parallel Training For Single-Machine Multiclass Classifiers.

Hsu, C.-W., & Lin, C.-J. (2002). A Comparison of Methods for Multi-class Support Vector Machines. IEEE transactions on neural networks.

Hsu, C.-W., Chang, C.-C., & Lin, C.-J. (2016, 10 02). A Practical Guide to Support Vector Classification. Retrieved from LIBSVM website: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

Ng, A. (n.d.). CS229 Lecture notes - Part V Support Vector Machines. Retrieved from http://cs229.stanford.edu/notes/cs229-notes3.pdf

Osuna, E., Freund, R., & Girosi, F. (1997). An Improved Training Algorithm for Support Vector.

Proceedings of IEEE NNSP'97.

Platt, J. C. (1998). Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Microsoft Research.

Platt, J. C., Cristianini, N., & Shawe-Taylor, J. (2000). Large margin DAGs for multiclass classification. MIT Press.

Rojas, R. (1996). Neural Networks: A Systematic Introduction. Springer.

Russell, S., & Norvig, P. (2010). Artificial Intelligence: A Modern Approach. Pearson. Tyson Smith, B. (2004). Lagrange Multipliers Tutorial in the Context of Support Vector

Machines. Newfoundland.

Vapnik, V. (1982). Estimation of Dependences Based on Empirical Data. Springer. Vapnik, V. N. (1998). Statistical Learning Theory. Wiley.

Weston, J., & Watkins, C. (1999). Support Vector Machines for Multi-Class Pattern Recognition.

Proceedings of the Seventh European Symposium on Artificial Neural Networks.


分享到:


相關文章: