[AI Boostcamp Day 26] Basic RNN, LSTM, GRU

What did I learn today

RNN
- image source
- $h_{t-1}$ : prior hidden-state vector
- $x_t$ input vector at time t
- $h_t$ current hidden-state vector
- $f_w$ RNN function
- $y_t$ output vector at time t
- $y_t$ the occurance depends on the task. Sentiment analysis? -> t at the last time interval.
Types of RNN
- One to Many: other than the single input time interval, zero-vector input!
- Many to One: no $y_t$ output except one time interval!
- Many to Many 1: e.g. after reading the sequence input, output the translation of the given sentence
- Many to Many 2: While receiving the input, output the result at spot!
RNN Characeter-level Language Model
- image source
- Straigh forward!
- The larger the hidden laer dimension is, the more information the network retains!
- Very time & resource comsuming because the network has to forward the entire seuqene to get the loss and do deriviation to computet the gradient. -> break into smaller chunks and train with that.
- Why we don’t use RNNs: due to vanishing, exploding gradient problem.
LSTM
- Solves vanshing gradient problem!
- Overall Model
  - image source
  - $c_t$ (upper inflow from the prior model) has more information than the hidden state vector.
  - $h_t$ (below inflow from the prior model) mainly has information about the output of next layer and is used as an input for the output of next layer.
- Variables
  - image source
  - Sigmoid: has value between 0 and 1. Decides what percentage the function will preserve from the intial value.
  - tanh: has value between -1 and 1. Used when conveying information
- Forget Gate
  - image source
  - the elements of $f_t$ is between 0 and 1. Decides the ratio of information retention.
- Gate Gate
  - image source
  - why $i_t$ is multiplied? To remove excess information!
- Output
  - image source
  - Use tanh to present the result as an info
  - Utilize only the part of the info as an output
  - $c_t$ has all the info, even those that the model do not need at the present sequence.
  - Modifying $c_t$ , the model can get $h_t$ that only has the information we need at the present sequence
- Why use tanh? To give non-linearity!
- Why tanh in specific? To prevent the vanishing graidnet problem!
GRU
- image source
- Cell state and hidden state combined
- The larger the input gate, $z_t$ , the more information is lost from the prior hidden state and vice versa.
Why GRU and LSTM no gradient vanishing & exploding problem?
- Use addition, not redundant multiplication