1. Types of Retrieval
- Boolean Retrieval: retrieves a document if it includes specified word
- Rank Retrieval: specify a weight for terms. The weight for words differs depending whether the term came from the context or query. The weighting strategy for the terms can vary.
- TF-IDF: compute weight of terms by considering the frequency of words (TF) and if the word advents only in a few documents give high weigt (IDF).
- BM25: a modified version of TF-IDF. Can do parameter tuning.
2. Elasticsearch
- Elastic search indexes each document by words, decreasing the time complexity from O(n) to O(1); index of all documents that include a word is saved. This method is called Inverted Index.
a. Analyzer
- Before processing the query and returning the matching value, a function called analyzer in Easlticsearch preprocesses the input.
- Flow: Char Filters - Tokenizer - Token Filters
- Char Filters: filters characters such as html tokens before tokenizing
- HTML Strip character filter
- Mapping character filter
- Pattern replace character filter
- Tokenizer: breaks down the words to smaller units of meaning. Can select from multiple choices
- Word Oriented Tokenizer: standard, letter, whitespace
- Partial Word Tokenizer: N-gram
- Structured Text Tokenizer: Keyword
- Token Filter: removing tokenized units
- remove words like ‘the’, which does not have meaning
- put words that has different form but is basically the same word
- includes filters such as stemmer, n-gram, stop-words, shingle, uppercase, lowercase
b. Settings
- Scoring: computes score to return documents that best match the query
- Default is BM25, but can use DFR, DFI, IB, LM Dirichlet