Source of Bias in MRC

1. Mitigating Training Bias

Reader models are prone to train only on a limited designated number of documents
If the reader model is given a text from a relatively new subject, the reader model will have hard time producing the answer
Solution:
- For the reader model: by enabling the model to provide a “no answer”, we can train the model on larger sets of documents with no answer
- For the retriever model: train the DPR with negative samples with high BM25 scores

If the dataset creator already knows the answer of a question, then the question can have too much hints, making the retrieval task too simple
We can adopt natural questions that real users have produced on the internet to avoid such bias
Ground truth might not include all possible answers, penalizing the model that actually produced a correct answer