通過公式結合文字通俗易懂的解釋 BERT模型(Bidirectional Encoder Representations from Transformers)
Transformer Block
BERT中的點積注意力模型
公式:
![BERT代碼實現及解讀](http://p2.ttnews.xyz/loading.gif)
代碼:
class Attention(nn.Module):
"""
Scaled Dot Product Attention
"""
def forward(self, query, key, value, mask=None, dropout=None):
scores = torch.matmul(query, key.transpose(-2, -1)) \\
/ math.sqrt(query.size(-1))
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# softmax得到概率得分p_atten,
p_attn = F.softmax(scores, dim=-1)
# 如果有 dropout 就隨機 dropout 比例參數
if dropout is not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attnt
頭條代碼編輯賊不方便,可以看我博客:https://www.shikanon.com/2019/機器學習/BERT代碼實現及解讀/
在 self-attention 的計算過程中, 通常使用 ℎ 來計算, 也就是一次計算多個句子,多句話得長度並不一致,因此,我們需要按照最大得長度對短句子進行補全,也就是padding零,但這樣做得話,softmax計算就會被影響,e0=1也就是有值,這樣就會影響結果,這並不是我們希望看到得,因此在計算得時候我們需要把他們mask起來,填充一個負無窮(-1e9這樣得數值),這樣計算就可以為0了,等於把計算遮擋住。
多頭自注意力模型
公式:
![BERT代碼實現及解讀](http://p2.ttnews.xyz/loading.gif)
Attention Mask
代碼:
class MultiHeadedAttention(nn.Module):
"""
Take in model size and number of heads.
"""
def __init__(self, h, d_model, dropout=0.1):
# h 表示模型個數
super().__init__()
assert d_model % h == 0
# d_k 表示 key長度,d_model表示模型輸出維度,需保證為h得正數倍
self.d_k = d_model // h
self.h = h
self.linear_layers = nn.ModuleList([nn.Linear(d_model, d_model) for _ in range(3)])
self.output_linear = nn.Linear(d_model, d_model)
self.attention = Attention()
self.dropout = nn.Dropout(p=dropout)
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
query, key, value = [l(x).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
for l, x in zip(self.linear_layers, (query, key, value))]
x, attn = self.attention(query, key, value, mask=mask, dropout=self.dropout)
x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.h * self.d_k)
return self.output_linear(x)
Position-wise FFN
Position-wise FFN 是一個雙層得神經網絡,在論文中採用ReLU做激活層:
公式:
注:在 google github中的BERT的代碼實現中用Gaussian Error Linear Unit代替了RelU作為激活函數
代碼:
class PositionwiseFeedForward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
super(PositionwiseFeedForward, self).__init__()
self.w_1 = nn.Linear(d_model, d_ff)
self.w_2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)
self.activation = GELU()
def forward(self, x):
return self.w_2(self.dropout(self.activation(self.w_1(x))))
class GELU(nn.Module):
"""
Gaussian Error Linear Unit.
This is a smoother version of the RELU.
Original paper: https://arxiv.org/abs/1606.08415
"""
def forward(self, x):
return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
Layer Normalization
LayerNorm實際就是對隱含層做層歸一化,即對某一層的所有神經元的輸入進行歸一化(沿著通道channel方向),使得其加快訓練速度:
層歸一化公式:
代碼:
class LayerNorm(nn.Module):
"Construct a layernorm module (See citation for details)."
def __init__(self, features, eps=1e-6):
super(LayerNorm, self).__init__()
self.a_2 = nn.Parameter(torch.ones(features))
self.b_2 = nn.Parameter(torch.zeros(features))
self.eps = eps
def forward(self, x):
# mean(-1) 表示 mean(len(x)), 這裡的-1就是最後一個維度,也就是最裡面一層的維度
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
return self.a_2 * (x - mean) / (std + self.eps) + self.b_2
殘差連接
殘差連接就是圖中Add+Norm層。每經過一個模塊的運算, 都要把運算之前的值和運算之後的值相加, 從而得到殘差連接,殘差可以使梯度直接走捷徑反傳到最初始層。
殘差連接公式:
y=f(x)+x
X 表示輸入的變量,實際就是跨層相加。
代碼:
class SublayerConnection(nn.Module):
"""
A residual connection followed by a layer norm.
Note for code simplicity the norm is first as opposed to last.
"""
def __init__(self, size, dropout):
super(SublayerConnection, self).__init__()
self.norm = LayerNorm(size)
self.dropout = nn.Dropout(dropout)
def forward(self, x, sublayer):
"Apply residual connection to any sublayer with the same size."
# Add and Norm
return x + self.dropout(sublayer(self.norm(x)))
Transform Block
代碼:
class TransformerBlock(nn.Module):
"""
Bidirectional Encoder = Transformer (self-attention)
Transformer = MultiHead_Attention + Feed_Forward with sublayer connection
"""
def __init__(self, hidden, attn_heads, feed_forward_hidden, dropout):
"""
:param hidden: hidden size of transformer
:param attn_heads: head sizes of multi-head attention
:param feed_forward_hidden: feed_forward_hidden, usually 4*hidden_size
:param dropout: dropout rate
"""
super().__init__()
# 多頭注意力模型
self.attention = MultiHeadedAttention(h=attn_heads, d_model=hidden)
# PFFN
self.feed_forward = PositionwiseFeedForward(d_model=hidden, d_ff=feed_forward_hidden, dropout=dropout)
# 輸入層
self.input_sublayer = SublayerConnection(size=hidden, dropout=dropout)
# 輸出層
self.output_sublayer = SublayerConnection(size=hidden, dropout=dropout)
self.dropout = nn.Dropout(p=dropout)
def forward(self, x, mask):
x = self.input_sublayer(x, lambda _x: self.attention.forward(_x, _x, _x, mask=mask))
x = self.output_sublayer(x, self.feed_forward)
return self.dropout(x)
Embedding嵌入層
Embedding採用三種相加的形式表示:
代碼:
class BERTEmbedding(nn.Module):
"""
BERT Embedding which is consisted with under features
1. TokenEmbedding : normal embedding matrix
2. PositionalEmbedding : adding positional information using sin, cos
3. SegmentEmbedding : adding sentence segment info, (sent_A:1, sent_B:2)
sum of all these features are output of BERTEmbedding
"""
def __init__(self, vocab_size, embed_size, dropout=0.1):
"""
:param vocab_size: total vocab size
:param embed_size: embedding size of token embedding
:param dropout: dropout rate
"""
super().__init__()
self.token = TokenEmbedding(vocab_size=vocab_size, embed_size=embed_size)
self.position = PositionalEmbedding(d_model=self.token.embedding_dim)
self.segment = SegmentEmbedding(embed_size=self.token.embedding_dim)
self.dropout = nn.Dropout(p=dropout)
self.embed_size = embed_size
def forward(self, sequence, segment_label):
x = self.token(sequence) + self.position(sequence) + self.segment(segment_label)
return self.dropout(x)
位置編碼(Positional Embedding)
位置嵌入的維度為 [ ℎ, ] , 嵌入的維度同詞向量的維度, ℎ 屬於超參數, 指的是限定的最大單個句長.
公式:
其所繪製的圖形:
代碼:
class PositionalEmbedding(nn.Module):
def __init__(self, d_model, max_len=512):
super().__init__()
# Compute the positional encodings once in log space.
pe = torch.zeros(max_len, d_model).float()
pe.require_grad = False
position = torch.arange(0, max_len).float().unsqueeze(1)
div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp()
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
# 對數據維度進行擴充,擴展第0維
pe = pe.unsqueeze(0)
# 添加一個持久緩衝區pe,緩衝區可以使用給定的名稱作為屬性訪問
self.register_buffer('pe', pe)
def forward(self, x):
return self.pe[:, :x.size(1)]
Segment Embedding
主要用來做額外句子或段落劃分新夠詞,
這裡加入了三個維度,分別是句子
開頭【CLS】,下一句【STEP】,遮蓋詞【MASK】
例如: [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP]
代碼:
class SegmentEmbedding(nn.Embedding):
def __init__(self, embed_size=512):
# 3個新詞
super().__init__(3, embed_size, padding_idx=0)
Token Embedding
代碼:
class TokenEmbedding(nn.Embedding):
def __init__(self, vocab_size, embed_size=512):
super().__init__(vocab_size, embed_size, padding_idx=0)
BERT
class BERT(nn.Module):
"""
BERT model : Bidirectional Encoder Representations from Transformers.
"""
def __init__(self, vocab_size, hidden=768, n_layers=12, attn_heads=12, dropout=0.1):
"""
:param vocab_size: 所有字的長度
:param hidden: BERT模型隱藏層大小
:param n_layers: Transformer blocks(layers)數量
:param attn_heads: 多頭注意力head數量
:param dropout: dropout rate
"""
super().__init__()
self.hidden = hidden
self.n_layers = n_layers
self.attn_heads = attn_heads
# paper noted they used 4*hidden_size for ff_network_hidden_size
self.feed_forward_hidden = hidden * 4
# 嵌入層, positional + segment + token
self.embedding = BERTEmbedding(vocab_size=vocab_size, embed_size=hidden)
# 多層transformer blocks
self.transformer_blocks = nn.ModuleList(
[TransformerBlock(hidden, attn_heads, hidden * 4, dropout) for _ in range(n_layers)])
def forward(self, x, segment_info):
# attention masking for padded token
# torch.ByteTensor([batch_size, 1, seq_len, seq_len)
mask = (x > 0).unsqueeze(1).repeat(1, x.size(1), 1).unsqueeze(1)
# embedding the indexed sequence to sequence of vectors
x = self.embedding(x, segment_info)
# 多個transformer 堆疊
for transformer in self.transformer_blocks:
x = transformer.forward(x, mask)
return x
語言模型訓練的幾點技巧
BERT如何做到自訓練的,一下是幾個小tip,讓其做到自監督訓練:
Mask
隨機遮蓋或替換一句話裡面任意字或詞, 然後讓模型通過上下文的理解預測那一個被遮蓋或替換的部分, 之後做的時候只計算被遮蓋部分的。
隨機把一句話中 15% 的 替換成以下內容:
- 1) 這些 有 80% 的幾率被替換成 【】 ;
- 2) 有 10% 的幾率被替換成任意一個其他的 ;
- 3) 有 10% 的幾率原封不動.
讓模型預測和還原被遮蓋掉或替換掉的部分,損失函數只計算隨機遮蓋或替換部分的Loss。
代碼:
class MaskedLanguageModel(nn.Module):
"""
predicting origin token from masked input sequence
n-class classification problem, n-class = vocab_size
"""
def __init__(self, hidden, vocab_size):
"""
:param hidden: output size of BERT model
:param vocab_size: total vocab size
"""
super().__init__()
self.linear = nn.Linear(hidden, vocab_size)
self.softmax = nn.LogSoftmax(dim=-1)
def forward(self, x):
return self.softmax(self.linear(x))
預測下一句
代碼:
class NextSentencePrediction(nn.Module):
"""
2-class classification model : is_next, is_not_next
"""
def __init__(self, hidden):
"""
:param hidden: BERT model output size
"""
super().__init__()
self.linear = nn.Linear(hidden, 2)
# 這裡採用了logsoftmax代替了softmax,
# 當softmax值遠離真實值的時候梯度也很小,logsoftmax的梯度會比
self.softmax = nn.LogSoftmax(dim=-1)
def forward(self, x):
return self.softmax(self.linear(x[:, 0]))
損失函數
負對數最大似然損失(negative log likelihood),也叫交叉熵(Cross-Entropy)公式:
代碼:
# 在Pytorch中 CrossEntropyLoss()等於NLLLoss+ softmax,因此如果用CrossEntropyLoss最後一層就不用softmax了
criterion = nn.NLLLoss(ignore_index=0)
# 2-1. NLL(negative log likelihood) loss of is_next classification result
next_loss = criterion(next_sent_output, data["is_next"])
# 2-2. NLLLoss of predicting masked token word
mask_loss = criterion(mask_lm_output.transpose(1, 2), data["bert_label"])
# 2-3. Adding next_loss and mask_loss : 3.4 Pre-training Procedure
loss = next_loss + mask_loss
語言模型訓練
代碼:
class BERTLM(nn.Module):
"""
BERT Language Model
Next Sentence Prediction Model + Masked Language Model
"""
def __init__(self, bert: BERT, vocab_size):
"""
:param bert: BERT model which should be trained
:param vocab_size: total vocab size for masked_lm
"""
super().__init__()
self.bert = bert
self.next_sentence = NextSentencePrediction(self.bert.hidden)
self.mask_lm = MaskedLanguageModel(self.bert.hidden, vocab_size)
def forward(self, x, segment_label):
x = self.bert(x, segment_label)
return self.next_sentence(x), self.mask_lm(x)
閱讀更多 機械視角 的文章