NLP - Transformer

Former RNNs
Transformers
Multi-Head Attention
Comparison of Computational Resources
Block-Based Model
Model Speicific Techniques
Decoder
- Detailed Process
- Masked Self-Attention
Model Performane

Transformer adopts the attention method of Seq2Seq with attention without the division of the encoder and the decoder. Concise model helps faster learning, and by stacking queries into a matrix, the model further boosts the computational speed, still obviating the long term dependency problem and considering the future input sequences.

1. Former RNNs

RNN: the information propagates through multiple mathematic transformations, casuing long term dependency problem
Bi-directional RNN: long term dependency problem still present, but each output can consider the words that will come after the present time step
Transformer even solves the long term dependency problem

2. Transformer

Uses only the attention model from Seq2Seq as the main model
General Structure

a. Key Components

Image Source

Query Vector: a vector transformed from the input vector of the corresponding time period via $W^Q$ to calculate the weights of each value vector in producing the final output $h_t$ .
Key Vector: vector transformed from all input vectors via the matrix product with $W^K$ . The model will calculate the weight for each value vector by doing inner product of a query vector and key vectors.
Value Vector: input vectors from all time period are transformed via to produce value vector $W^V$ . The weighted sum of these vectors is the output vector.

b. Detailed Process

The model trains matrix $W$
We first calculate the weight, where n is the length of the entire sequence. The result is probability (or ratio) that adds up to 1
- $weight_i = softmax(q_i\cdot k_1; q_i\cdot k_m; .. q_i \cdot k_n)$
Multiplying each element of the weight vector with the corresponding value vector, we get the output vector
- $h_i = weight_i \cdot (v_i;v_m;..v_n)$ (output vector for the ith time period)
Entire process simplified, we get
- $h_i = A(q_i, K, V) = \sum_j(\frac{exp(q_i\cdot k_j)}{\sum_r exp(q_i\cdot k_r)})v_j$
We have multiple query vectors, so if we stack them and make matrix $Q$ , the whole process is simplified as follows
- $H = A(Q,K,V) = softmax(QK^T)V$ or equivalently,
- $(|Q|*d_k)*(d_k*|K|)*(|V|*d_v) = (|Q|*d_v)$
Matrix multiplication is well optimized, making the self-attention model even faster
The model has huge improvement from the previous RNNs, also because the inputs from long period ago are nto subject to multiple transformations. All inputs have equal path to the outpiut of whenever time period, solving the long term dependency problem.
On top of that, the sequence can consider the inputs that would normally appear later, allowing more information of the context to be channelled into the model.

c. Scaling the Dot Product

Method:
- $H = A(Q,K,V) = softmax(\frac{QK^T}{d_k^{1/2}})V$
Rationale: let’s assume the elements of $Q$ and $K$ are mutually independent and follows a normal distribution with mean 0 and variance 1. The variance of $QK^T$ is $d_k$ , the dimension of the matrices. Having huge variance is not good for training because then the weights would vary significantly, meaning that the output would reflect only some of the value vectors. To prevent this, the model divides $QK^T$ with $d_k$ , feeding the softmax function with an input with the mean of 0 and the vairnace of 1.

3. Multi-Head Attention

Imgur
Image Source

Need: each attention layer specializes in encoding the input sequence. For example, a layer could specialize in understanding the relationship between a noun and its modifers.
$MultiHead(Q, K, V) = Concat(head_1,...head_h)W^o$
$head_i = Attention(QW_i^Q,KW_i^K,VW_i^V)$

4. Comparison of Computational Resources

Imgur
Image Source

$n$ is sequence length
$d$ is dimension of representation
$k$ is the kernel size
$r$ is the size of neighborhood
Self attention model requires larger number of computations, but the type of computation is matrix multiplication, thus requires less total time for training. In contrast, the recurrent model requires less number of computations, but the operation is sequential; the model has a series connection, making the total training time longer.

5. Block-Based Model

Imgur
Image Source

Residual connection: Propagates x to after the multi-head attention layer. The layer is trained to learn the residual, $f(x)-x$ , not $f(x)$
Layer normalization

6. Model Speicific Techniques

a. Normalization

Imgur
Image Source

Affine Transformation: $x \rightarrow ax+b$ . The parameters are all trainable.
Batch Norm: for each pertaining node, normalize the values from all batches collectively and affine transform.
Layer Norm: normalize each words vectors, then affine transform each sequence vector like below.

b. Positional Encoding

Need: transformer produces the same output wherever the words are place, as long as the input words are the same. See the structure of transformer to understand.
Image Source
Use sinusoidal functions, for example, use sine functions for the even dimensions, and consine for the others, with different frequencies for all dimensions and them to input vectors.

c. Warm-up Learning Rate Scheduler

Imgur
Image Source

Offsets initial large gradient with small learning rate. The apex in the middle further pushes the model that could have settled for a local minimun.

7. Decoder

When do we need to perform tasks that require additional sequence input, such as translation, we use models with decoder.

a. Detailed Process

Image Source

The query vector is from the decoder, and the key vector and the value vector is the output of the encoder
Use Masked decoder self-attention to produce query vector

b. Masked Self-Attention

At the inference stage, the model does not have any access to inputs from the later sequence. To emulate such environment in the training process, the model has to block the access to the inputs from later sequence.
This can be done by adequately masking the output from $Softmax(QK^T)$ as below.
Image Source
Purple region is the elements of the matrix that the value is set as 0.
Each row mathematically expressed, $weight_i = softmax(q_i\cdot k_1; q_i\cdot k_m; .. q_i \cdot k_n)$
Simply put, $q_i\cdot k_m = 0, (i<m)$

8. Model Performance

Reduces the training cost significantly
BLEU score surpassed all SOTA models at year 2014 for the given task

Share on

Twitter Facebook LinkedIn

NLP - Transformer

Contents

1. Former RNNs

2. Transformer

a. Key Components

b. Detailed Process

c. Scaling the Dot Product

3. Multi-Head Attention

4. Comparison of Computational Resources

5. Block-Based Model

6. Model Speicific Techniques

a. Normalization

b. Positional Encoding

c. Warm-up Learning Rate Scheduler

7. Decoder

a. Detailed Process

b. Masked Self-Attention

8. Model Performance

Share on

You may also enjoy

Source of Bias in MRC

Advanced Topis in MRC

Elastic Search Summary

부스트캠프 11주차 주간학습 정리