How do Elasticsearch calculate term freq when using CutoffFrequency in CommonTermsQuery?


(張泰瑋(Chang Tai Wei)) #1

I'm trying to understand how CutoffFrequency works.
And i found that the frequency calculated by elastic is not what i assumed.

here's my env:

  • i builded my index with 1 shard , 0 replicas

  • my query written in GO: elastic.NewCommonTermsQuery("title", "batman official trailers").Analyzer("synonym").CutoffFrequency(0.006).LowFreqMinimumShouldMatch(4)

  • synonym:

    • expand batman into bruce
  • target doc:

    • {"title": "Batman Batman Batman Batman Batman bruce Batman official trailers"}
  • statistics from requesting /_termvectors:

    {
      "_index": "posts",
      "_type": "posts",
      "_id": "227824665",
      "_version": 6,
      "found": true,
      "took": 38,
      "term_vectors": {
        "title": {
          "field_statistics": {
            "sum_doc_freq": 635,
            "doc_count": 155,
            "sum_ttf": 641
          },
          "terms": {
            "batman": {
              "doc_freq": 2,
              "ttf": 12,
              "term_freq": 6,
              "tokens": [...]
            },
            "official": {
              "doc_freq": 2,
              "ttf": 2,
              "term_freq": 1,
              "tokens": [...]
            },
            "bruce": {
              "doc_freq": 2,
              "ttf": 2,
              "term_freq": 1,
              "tokens": [...]
            },
            "trailers": {
              "doc_freq": 2,
              "ttf": 2,
              "term_freq": 1,
              "tokens": [...]
            }
          }
        }
      }
    }
    
  • the frequency i assumed:

    • batman: 12 / 641 = 0.018
    • official: 2/641 = 0.003
    • bruce: 2/641 = 0.003
    • trailers: 2/641 = 0.003

but it turns out that CutoffFrequency(0.006) would not retrieve target doc and CutoffFrequency(0.007) would retire the target doc
Would anyone know how elasticsearch calculate term frequency?


(張泰瑋(Chang Tai Wei)) #2

OK, i figured it out by myself

the Frequency here is doc freq of a term

e.q.:

  • batman: 2/155
  • official: 2/155
  • bruce: 2/155
  • trailers: 2/155

(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.