Elastic Search Summary

1. Types of Retrieval

Boolean Retrieval: retrieves a document if it includes specified word
Rank Retrieval: specify a weight for terms. The weight for words differs depending whether the term came from the context or query. The weighting strategy for the terms can vary.
TF-IDF: compute weight of terms by considering the frequency of words (TF) and if the word advents only in a few documents give high weigt (IDF).
BM25: a modified version of TF-IDF. Can do parameter tuning.

Elastic search indexes each document by words, decreasing the time complexity from O(n) to O(1); index of all documents that include a word is saved. This method is called Inverted Index.

Before processing the query and returning the matching value, a function called analyzer in Easlticsearch preprocesses the input.
Flow: Char Filters - Tokenizer - Token Filters
Char Filters: filters characters such as html tokens before tokenizing
- HTML Strip character filter
- Mapping character filter
- Pattern replace character filter
Tokenizer: breaks down the words to smaller units of meaning. Can select from multiple choices
- Word Oriented Tokenizer: standard, letter, whitespace
- Partial Word Tokenizer: N-gram
- Structured Text Tokenizer: Keyword
Token Filter: removing tokenized units
- remove words like ‘the’, which does not have meaning
- put words that has different form but is basically the same word
- includes filters such as stemmer, n-gram, stop-words, shingle, uppercase, lowercase

Scoring: computes score to return documents that best match the query
- Default is BM25, but can use DFR, DFI, IB, LM Dirichlet