MRC Summary

1. What is MRC

a. MRC tasks

Extractive Answer Task: finding answer from a designated document
Descriptive Answer Task: finding answer without a designated document
Multiple-choice Datasets: selecting answer from multiple choices

b. Challenges

Multi-hop reasoning
Unanswerable questions

c. Evalutation Metric

Exact Match
F1 Score: frequency of token overlap

2. Unicode & Tokenization

a. Unicode

An uniform way of processing characters
ex) U+AC00
‘U+’: means unicode
‘AC00’: hexadecimal code point

b. Encoding

Formatting unicodes into binary numbers
UTF-8, UTF-16, UTF-32
The number indicate the number of bytes allocated for the represenation of each character

c. Tokenization

Subword tokenization: breaking down unfrequented words and using the whole word when appears more frequently
Byte-Pair Encoding: replace the most frequented word with a different character. Repeat it multiple times

3. MRC Types:

a. Selection Based

Extract the answer from the document by slicing
Model: classifier
Loss: location of the answer

b. Generation Based

Model: Seq-to-Seq PLM
Loss: compute F1 by comparing
Bart: Bidirectional Encoder + Autoregressive (Uni-directional) Decoder
- Bart is pretrained to reconstruct a text with noise such as sentence permutation and deleted or masked tokens

Preprocessing for Generative MRC:

Use natural language to replace the speical tokens
Most generative models like Bart do not receive token_types_ids because speicifying it proved futile. Only input_ids and attention_mask

4. Postprocessing

a. For Selection Based MRC

Find the highest position predictions for start and end position respectively
Remove invalid composition
Descendingly sort the composition according to the score
Get Top-K
b. For Generation Based MRC
Decoding: the output of a prior step is the input of the current step
Since the model can generate multiple top-k tokens for each stage, several tatics to adopt
- Greedy Search: search only the option with the highest probability for each token output
- Exhaustive Search: search all top-k token outputs and its branches
- Beam Search: maintain only the top-k candidates. Eliminate the rest

5. Passage Retrieval

Finding the right passage for a given query
MRC could be broken down into two structures. A Retrieval System and a Answering system
Use query embedding and passage embedding to yield similiarity score for each pair
a. Passage Embedding
Sparse Embedding: BoW + TF-IDF. Each dimension corresponds to a specific word
- Limits of Sparse Embedding: the dimension is too big, similarity of the tokens are not taken into consideration, and no Further training is possible
Dense Embedding:
- Upsides of Dense Embedding: less dimension of a embedding vector. The embedding is similar if the meanings are close
- Use the <CLS> tokens NLP models like BERT to create an embedding and apply dot product between the query embedding and the passage embedding.
- Better Dense Encoding?
  - Larger DPR
  - Larger Pretrained Model
  - Better data
  - Choosing harder negative samples!
    - Train the enocder with passages that have high TF-IDF score but is not an answer

Share on

Twitter Facebook LinkedIn

MRC Summary

1. What is MRC

a. MRC tasks

b. Challenges

c. Evalutation Metric

2. Unicode & Tokenization

a. Unicode

b. Encoding

c. Tokenization

3. MRC Types:

a. Selection Based

b. Generation Based

Preprocessing for Generative MRC:

4. Postprocessing

a. For Selection Based MRC

b. For Generation Based MRC

5. Passage Retrieval

a. Passage Embedding

Share on

You may also enjoy

Source of Bias in MRC

Advanced Topis in MRC

Elastic Search Summary

부스트캠프 11주차 주간학습 정리