What did I learn today
- Seq2Seq
- Image Source
- Takes a sequence of words as input and gives another sequence as an output
- Composed of Encoder and Decoder
- Decoder receives ‘start of sentence’ token at the beginning and gives ‘end of sentence’ token to end the sequence’
- Problem? The huge volume of the sequence should be cramped into a netwrok even if the hidden dimension is relatively small + The information from the early sequence disappears! -> need of attention model
- Seq2Seq Model with Attention
- Image Source
- Get hidden vectors from the encoder and a hidden vector from the decoder, and do inner product for each vector from the encoder with that from the decoder -> attention scores
- With the attention scores, give corresponding weights to each hidden vector from the encoder. The weighted sum is the attention output
- Attention output stacked to the decoder output constitutes the final output vector y
- Teacher Forcing
- In the decoder, the each layer feeds on the output from the previous layer
- The model could either train on the output from the previous layer or use ground trugh instead. This is called teacher forcing
- TF is faster, but less accurate
- Different Attention Mechanisms
- Image Source
- Luong:
- General: When calculating the weight, add a learnable matrix between the hidden state output from the encoder and the decoder.
- Concat: Concat the two hidden state vectors from the encoder and the decoder, concat an feed to a neural network to calculate the weight
- Attention is great!
- Improves neural machine translation performance. Enables focusing on particular parts of the source (which hidden state vector).
- Solves bottleneck problem arising from the fact that previous models had to rely solely on the outputs of the last layer
- Due to shorter gradient back propagation path, attention solves the vanishing graident problem
- Provides interpretability just by consulting the attention weights
- Beam Search
- Greedy Decoding: getting output one by one. The problem is that the model cannot fix the prior result even if it later know that the output was wrong
- Exhaustive Search: Try all possible sequences of y with V^t possible number of cases -> to complex
- Beam Search
- In between greedy decodign and exhuastive search. On each time step of the decoder, select k most probable partial translations.
- Image Source
- Decision Standard

- Keep the best k hypothsis each step

- For each k hypothesis, get top k most likely words and calculate scores. So the number of cases temporarily become k^2
- Decode until the model produces the token
- If some hypotheses complete ealier then others, save it and continue calculating others
- The longer the sequence, the smaller the probability -> normalize by the length of the sequence
- BLEU Score
- Precision: from the answers the model produces, how many of them are right?
- Recall: from the entire relevant answer, how many of them did the model discover?
- Get F1-score (the harmnoic mean of precion and recall)
- The problem -> does not consider the sequence of the output
- Compute N-gram (one to four) overlap and apply geometric mean
- Qualify the geometric mean by brevity penalty. Brevity penalty is the total precision that allows overlap, with the maximum cap of 1.