I'm trying to understand how CutoffFrequency works.
And i found that the frequency calculated by elastic is not what i assumed.
here's my env:
-
i builded my index with
1 shard , 0 replicas
-
my query written in GO:
elastic.NewCommonTermsQuery("title", "batman official trailers").Analyzer("synonym").CutoffFrequency(0.006).LowFreqMinimumShouldMatch(4)
-
synonym:
- expand
batman
intobruce
- expand
-
target doc:
{"title": "Batman Batman Batman Batman Batman bruce Batman official trailers"}
-
statistics from requesting
/_termvectors
:{ "_index": "posts", "_type": "posts", "_id": "227824665", "_version": 6, "found": true, "took": 38, "term_vectors": { "title": { "field_statistics": { "sum_doc_freq": 635, "doc_count": 155, "sum_ttf": 641 }, "terms": { "batman": { "doc_freq": 2, "ttf": 12, "term_freq": 6, "tokens": [...] }, "official": { "doc_freq": 2, "ttf": 2, "term_freq": 1, "tokens": [...] }, "bruce": { "doc_freq": 2, "ttf": 2, "term_freq": 1, "tokens": [...] }, "trailers": { "doc_freq": 2, "ttf": 2, "term_freq": 1, "tokens": [...] } } } } }
-
the frequency i assumed:
- batman: 12 / 641 = 0.018
- official: 2/641 = 0.003
- bruce: 2/641 = 0.003
- trailers: 2/641 = 0.003
but it turns out that CutoffFrequency(0.006)
would not retrieve target doc and CutoffFrequency(0.007)
would retire the target doc
Would anyone know how elasticsearch calculate term frequency?