What did I learn today
- RNN
- image source
: prior hidden-state vector
input vector at time t
current hidden-state vector
RNN function
output vector at time t
the occurance depends on the task. Sentiment analysis? -> t at the last time interval.
- Types of RNN
- One to Many: other than the single input time interval, zero-vector input!
- Many to One: no
output except one time interval!
- Many to Many 1: e.g. after reading the sequence input, output the translation of the given sentence
- Many to Many 2: While receiving the input, output the result at spot!
- RNN Characeter-level Language Model
- image source
- Straigh forward!
- The larger the hidden laer dimension is, the more information the network retains!
- Very time & resource comsuming because the network has to forward the entire seuqene to get the loss and do deriviation to computet the gradient. -> break into smaller chunks and train with that.
- Why we don’t use RNNs: due to vanishing, exploding gradient problem.
- LSTM
- Solves vanshing gradient problem!
- Overall Model
- image source
(upper inflow from the prior model) has more information than the hidden state vector.
(below inflow from the prior model) mainly has information about the output of next layer and is used as an input for the output of next layer.
- Variables
- image source
- Sigmoid: has value between 0 and 1. Decides what percentage the function will preserve from the intial value.
- tanh: has value between -1 and 1. Used when conveying information
- Forget Gate
- image source
- the elements of
is between 0 and 1. Decides the ratio of information retention.
- Gate Gate
- image source
- why
is multiplied? To remove excess information!
- Output
- image source
- Use tanh to present the result as an info
- Utilize only the part of the info as an output
has all the info, even those that the model do not need at the present sequence.
- Modifying
, the model can get
that only has the information we need at the present sequence
- Why use tanh? To give non-linearity!
- Why tanh in specific? To prevent the vanishing graidnet problem!
- GRU
- image source
- Cell state and hidden state combined
- The larger the input gate,
, the more information is lost from the prior hidden state and vice versa.
- Why GRU and LSTM no gradient vanishing & exploding problem?
- Use addition, not redundant multiplication