快速使用 Tensorflow 讀取 7 萬數據集!

作者 | 郭俊麟

1.Brief 概述

這篇文章中,我們使用知名的圖片數據庫「THE MNIST DATABASE」作為我們的圖片來源,它的數據內容是一共七萬張28×28像素的手寫數字圖片。

並被分成六萬張訓練集與一萬張測試集,其中訓練集裡面,又有五千張圖片被用來作為驗證使用,該數據庫是公認圖像處理的 "Hello World" 入門級別庫,在此之前已經有數不清的研究,圍繞著這個模型展開。


  1. .jpeg: height, width, channels;
  2. .png : height, width, channels, alpha。

(注意: .png 儲存格式的圖片含有透明度的信息,在處理圖片的時候可以捨棄。)

這些圖像使用模塊如opencv導入到 python 中後,是以列表的方式呈現排列的數據,並且每次令image = cv2.imread()這類方式把數據指向到一個 image物件時。



這樣的流程在移動和讀取上都顯然沒有優勢,因此我們需要把數據迴歸到其最基本的本質 「二進制」 上。

2.Binary Data 二進制數據

Reasons for using binary data,使用二進制數據的理由



而這個比喻中的滑水道入口,代表的是深度學習 GPU 計算端口,準備下滑的人代表數據本身,而我們現在需要優化的,就是如何讓 GPU 在還沒處理完這一個數據之前,就已經為它準備好下一批預處理數據。

讓 GPU 永遠保持工作狀態可以進一步提升整體運算的效率,方法之一就是讓數據迴歸到 「二進制」 的本質。



而我選擇的入門數據庫 MNIST 已經很貼心的幫我們處理好預處理的部分,分為四個類別:

  • 測試集圖像數據: t10k-images-idx3-ubyte.gz;
  • 測試集圖像標籤: t10k-labels-idx1-ubyte.gz;
  • 訓練集圖像數據: train-images-idx3-ubyte.gz;
  • 訓練集圖像標籤: train-labels-idx1-ubyte.gz。


3.The approach to load images 讀取數據的方法

既然知道了數據庫裡面的結構是二進制數據,接下來就可以使用 python 裡面的模塊包解析數據,壓縮文件為 .gz 因此對應到打開此文件類型的模塊名為 gzip,代碼如下:

 import gzip, os
import numpy as np
location = input('The directory of MNIST dataset: ')
path = os.path.join(location, 'train-images-idx3-ubyte.gz')
with gzip.open(path, 'rb') as fi:
data_i = np.frombuffer(fi.read(), dtype=np.int8, offset=16)
images_flat_all = data_i.reshape(-1, 784)
print('----- Separation -----')
print('Size of images_flat: ', len(images_flat_all))
print("The file directory doesn't exist!")
### ----- Result is shown below ----- ###
The directory of MNIST dataset: /home/abc/MNIST_data
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
----- Separation -----
Size of images_flat: 60000
path_label = os.path.join(location, 'train-labels-idx1-ubyte.gz')
with gzip.open(path_label, 'rb') as fl:
data_l = np.frombuffer(fl.read(), dtype=np.int8, offset=8)
print('----- Separation -----')
print('Size of images_labels: ', len(data_l), type(data_l[0]))
### ----- Result is shown below ----- ###
[5 0 4 ... 5 6 8]
----- Separation -----
Size of images_labels: 60000

代碼分為上下半段,上半段的代碼用來提取MNIST DATASET中訓練集的六萬個圖像樣本,每一個樣本都是由28×28尺寸的圖片數據拉直成一個1×784 長度的向量形式記錄下來。


4.Explanation to the code 代碼說明



  • 使用 gzip.open 的 'rb' 讀取二進制模式打開指定的壓縮文件;
  • 為了轉換數據成為 np.array ,使用 .frombuffer;
  • 原本的二進制數據格式使用 dtype 修改成人類讀得懂的八進制格式;
  • MNIST 原始數據中直到第十六位數才開始描述圖像信息,而數據標籤則是第八位就開始描述信息,因此 offset 設置從第十六或是八位開始讀取;
  • 讀出來的數據是一整條六萬個向量拼起來的數據,因此需要重新拼接數據, .reshape(-1, 784) 中的 -1 像一個未知數一樣,數據整形的過程中,只要 column = 784,那 row 是多少就是多少;
  • 剝離出對應的標籤時,最後還需要對其使用 one_hot() 數據的轉換,讓標籤以例如 [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] 的形式表示 "3" 的意思,目的是方便套入損失函數中運算,並尋找最優解。

把數據使用 numpy 數組描述好處是處理效率高,且此庫和大多數數據處理的庫都相容,不論是便利性和效率都是很大的優勢。

後面兩個鏈接 "numpy.frombuffer" "在NumPy中使用動態數組" 進一步深入的講述了函數的用法。

5.Linear Model 線性模型



import numpy as np
import tensorflow as tf
x_data = np.random.rand(100).astype(np.float32)
y_data = x_data * 0.1 + 0.3
weight = tf.Variable(tf.random_uniform(shape=[1], minval=-1.0, maxval=1.0))
bias = tf.Variable(tf.zeros(shape=[1]))
y = weight * x_data + bias
loss = tf.reduce_mean(tf.square(y - y_data))
optimizer = tf.train.GradientDescentOptimizer(0.5)
training = optimizer.minimize(loss)

sess = tf.Session()
init = tf.global_variables_initializer()
for step in range(101):
if step % 10 == 0:
print('Round {}, weight: {}, bias: {}'
.format(step, sess.run(weight[0]), sess.run(bias[0])))

其中我們可以看到沿著x軸上對應的y有兩組解,其中的y_data是我們預設的正解,而另外一個由wx + b計算產生的y則是我們要用來擬合正解的未知解,對應同一樣東西x的兩個不同的y軸值接下來需要被套入一個選定的損失函數中。




圖像數據有一點在計算上看起來不同上面示例的地方是: 每一個像素的計算被統一包含進了一個大的矩陣中,被作為整體運算的其中一個小單元平行處理,大大的加速整體運算的進程。


6.MNIST in Linear Model

梳理了一遍線性模型與MNIST數據集的組成元素後,接下來就是基於 Tensorflow搭建一個線性迴歸的手寫數字識別算法,有以下幾點需要重新聲明:

  1. batch size: 每一批次訓練圖片的數量需要調控以免內存不夠;
  2. loss function: 損失函數的原理是計算預測和實際答案之間的差距。


  1. 需要一個很簡單方便的方法呼叫我們需要的 MNIST 數據,因此需要寫一個類;
  2. 開始搭建 Tensorflow 數據流圖,用節點設計一個 wx + b 的線性運算;
  3. 把運算結果和實際標籤帶入損失函數中求出損失值;
  4. 使用梯度下降法求出損失值的最小值;
  5. 迭代訓練後,查看訓練結果的準確率;
  6. 檢查錯誤判斷的圖片被歸類成了什麼標籤。
import gzip, os
import numpy as np
################ Step No.1 to well manage the dataset. ################
class MNIST:
# Images size is told in the official website 28*28 px.
image_size = 28
image_size_flat = image_size * image_size
# Let the validation set flexible when making an instance.
def __init__(self, val_ratio=0.1, data_dir='MNIST_data'):
self.val_ratio = val_ratio
self.data_dir = data_dir
# Load 4 files to individual lists with one string pixels.
img_train = self.load_flat_images('train-images-idx3-ubyte.gz')
lab_train = self.load_labels('train-labels-idx1-ubyte.gz')
img_test = self.load_flat_images('t10k-images-idx3-ubyte.gz')
lab_test = self.load_labels('t10k-labels-idx1-ubyte.gz')
# Determine the actual number of training / validation sets.
self.val_train_num = round(len(img_train) * self.val_ratio)
self.main_train_num = len(img_train) - self.val_train_num
# The normalized image pixels value can be more convenient when training.
# dtype=np.int64 would be more general when applying to Tensorflow.
self.img_train = img_train[0:self.main_train_num] / 255.0
self.lab_train = lab_train[0:self.main_train_num].astype(np.int)
self.img_train_val = img_train[self.main_train_num:] / 255.0
self.lab_train_val = lab_train[self.main_train_num:].astype(np.int)
# Also convert the format of testing set.
self.img_test = img_test / 255.0
self.lab_test = lab_test.astype(np.int)
# Extract the same codes from "load_flat_images" and "load_labels".
# This method won't be called during training procedure.
def load_binary_to_num(self, dataset_name, offset):
path = os.path.join(self.data_dir, dataset_name)
with gzip.open(path, 'rb') as binary_file:
# The datasets files are stored in 8 bites, mind the format.
data = np.frombuffer(binary_file.read(), np.uint8, offset=offset)
return data
# This method won't be called during training procedure.
def load_flat_images(self, dataset_name):

# Images offset position is 16 by default format
data = self.load_binary_to_num(dataset_name, offset=16)
images_flat_all = data.reshape(-1, self.image_size_flat)
return images_flat_all
# This method won't be called during training procedure.
def load_labels(self, dataset_name):
# Labels offset position is 8 by default format.
labels_all = self.load_binary_to_num(dataset_name, offset=8)
return labels_all
# This method would be called for training usage.
def one_hot(self, labels):
# Properly use numpy module to mimic the one hot effect.
class_num = np.max(self.lab_test) + 1
convert = np.eye(class_num, dtype=float)[labels]
return convert
path = '/home/abc/MNIST_data'
data = MNIST(val_ratio=0.1, data_dir=path)
import tensorflow as tf
flat_size = data.image_size_flat
label_num = np.max(data.lab_test) + 1
################ Step No.2 to construct tensor graph. ################
x_train= tf.placeholder(dtype=tf.float32, shape=[None, flat_size])
t_label_oh = tf.placeholder(dtype=tf.float32, shape=[None, label_num])
t_label = tf.placeholder(dtype=tf.int64, shape=[None])
################ These are the values ################
# Initialize the beginning weights and biases by random_normal method.
weights = tf.Variable(tf.random_normal([flat_size, label_num],
mean=0.0, stddev=1.0,
biases = tf.Variable(tf.random_normal([label_num], mean=0.0, stddev=1.0,
########### that we wish to get by training ##########
logits = tf.matmul(x_train, weights) + biases # < Annotation No.1 >
# Shrink the distances between values into 0 to 1 by softmax formula.
p_label_soh = tf.nn.softmax(logits)
# Pick the position of largest value along y axis.
p_label = tf.argmax(p_label_soh, axis=1)
####### Step No.3 to get a loss value by certain loss function. #######
# This softmax function can not accept input being "softmaxed" before.
CE = tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=t_label_oh)
# Shrink all loss values in a matrix to only one averaged loss.
loss = tf.reduce_mean(CE)
#### Step No.4 get a minimized loss value using gradient descent. ####
# Decrease this only averaged loss to a minimum value by using gradient descent.
optimizer = tf.train.AdamOptimizer(learning_rate=0.5).minimize(loss)
# First return a boolean list values by tf.equal function

correct_predict = tf.equal(p_label, t_label)
# And cast them into 0 and 1 values so that its average value would be accuracy.
accuracy = tf.reduce_mean(tf.cast(correct_predict, dtype=tf.float32))
sess = tf.Session()
###### Step No.5 iterate the training set and check the accuracy. #####
# The trigger to train the linear model with a defined cycles.
def optimize(iteration, batch_size=32):
for i in range(iteration):
total = len(data.lab_train)
random = np.random.randint(0, total, size=batch_size)
# Randomly pick training images / labels with a defined batch size.
x_train_batch = data.img_train[random]
t_label_batch_oh = data.one_hot(data.lab_train[random])
batch_dict = {
x_train: x_train_batch,
t_label_oh: t_label_batch_oh
sess.run(optimizer, feed_dict=batch_dict)
# The trigger to check the current accuracy value
def Accuracy():
# Use the totally separate dataset to test the trained model
test_dict = {
x_train: data.img_test,
t_label_oh: data.one_hot(data.lab_test),
t_label: data.lab_test
Acc = sess.run(accuracy, feed_dict=test_dict)
print('Accuracy on Test Set: {0:.2%}'.format(Acc))
### Step No.6 plot wrong predicted pictures with its predicted label.##
import matplotlib.pyplot as plt
# We can decide how many wrong predicted images are going to be shown up.
# We can focus on the specific wrong predicted labels
def wrong_predicted_images(pic_num=[3, 4], label_number=None):
test_dict = {
x_train: data.img_test,
t_label_oh: data.one_hot(data.lab_test),
t_label: data.lab_test
correct_pred, p_lab = sess.run([correct_predict, p_label],
# To reverse the boolean value in order to pick up wrong labels
wrong_pred = (correct_pred == False)
# Pick up the wrong doing elements from the corresponding places
wrong_img_test = data.img_test[wrong_pred]
wrong_t_label = data.lab_test[wrong_pred]
wrong_p_label = p_lab[wrong_pred]
fig, axes = plt.subplots(pic_num[0], pic_num[1])
fig.subplots_adjust(hspace=0.3, wspace=0.3)

edge = data.image_size
for ax in axes.flat:
# If we were not interested in certain label number,
# pick up the wrong predicted images randomly.
if label_number is None:
i = np.random.randint(0, len(wrong_t_label),
size=None, dtype=np.int)
pic = wrong_img_test[i].reshape(edge, edge)
ax.imshow(pic, cmap='binary')
xlabel = "True: {0}, Pred: {1}".format(wrong_t_label[i],
# If we are interested in certain label number,
# pick up the specific wrong images number randomly.
# Mind that np.where return a "tuple" that should be indexing.
specific_idx = np.where(wrong_t_label==label_number)[0]
i = np.random.randint(0, len(specific_idx),
size=None, dtype=np.int)
pic = wrong_img_test[specific_idx[i]].reshape(edge, edge)
ax.imshow(pic, cmap='binary')
xlabel = "True: {0}, Pred: {1}".format(wrong_t_label[specific_idx[i]],
# Pictures don't need any ticks, so we remove them in both dimensions
Accuracy() # Accuracy before doing anything
optimize(10); Accuracy() # Iterate 10 times
optimize(1000); Accuracy() # Iterate 10 + 1000 times
optimize(10000); Accuracy() # Iterate 10 + 1000 + 10000 times
### ----- Results are shown below ----- ###
Accuracy on Test Set: 11.51%
Accuracy on Test Set: 68.37%
Accuracy on Test Set: 86.38%
Accuracy on Test Set: 89.34%

Annotation No.1 tf.matmul(x_train, weights)

這個環節是在瞭解整個神經網絡訓練原理後,最重要的一個子標題,計算的矩陣模型中必須兼顧 random_batch 提取隨意多的數據集,同時符合矩陣乘法的運算原理,如下圖描述:

快速使用 Tensorflow 讀取 7 萬數據集!

矩陣位置前後順序很重要,由於數據集本身經過我們處理後,就是左邊矩陣的格式,在期望輸出為右邊矩陣的情況下,只能是 x·w 的順序,以 x 的隨機列數來決定後面預測的標籤列數, w 則決定有幾個歸類標籤。

Reason of using one_hot()

數據集經過一番線性運算後得出的結果如上圖所見,只能是 size=[None, 10] 的大小,但是數據集給的標籤答案是數字本身,因此我們需要一個手段把數字轉換成 10 個元素組成的向量,而第一選擇方法就是 one_hot() ,同時使用 one_hot 的結果來計算損失函數。



wrong_predicted_images(pic_num=[3, 3], label_number=5)
快速使用 Tensorflow 讀取 7 萬數據集!

其中可以自行選擇想要一次陳列幾張圖片,每次陳列的圖片都是隨機選擇,並同時可以選擇想查看的標籤類別,如上面一行函數設定為 5 ,則就只顯示標籤 5 的錯誤判斷圖片和誤判結果。最後等整個框架計算完畢後,需要執行下面代碼結束 tf.Session ,釋放內存:


