Inconsistent sum_doc_freq and sum_ttf numbers in _mtermvectors

yluo · April 19, 2018, 5:25pm

Hi there,

We have been using Elasticsearch and ingest-attachment to index program files for full-text searching and keyword extraction (filtered by TFxIDF) using metadata from _mtermvectors. However, we recently found an inconsistency about _mtermvectors api.

Suppose we indexed 46 program documents, and use _mtermvectors then we can get field_statistics like this:
"field_statistics": {
"sum_doc_freq": 8715,
"doc_count": 46,
"sum_ttf": 27672
}

Then I modified only one document, suppose it is called a.java, with only adding a space character and indexed this updated document, and used _mtermvectors, I got an updated field_statistics:
"field_statistics": {
"sum_doc_freq": 8995,
"doc_count": 47,
"sum_ttf": 30346
}
From above result, we found the doc_count became to 47 (we only had 46 documents), and sum_doc_freq and sum_ttf have duplicated term counts and seemed that terms in a.java got cloned somehow, with only one more space added.

I tried refreshing my index but I still see the cloned terms field statistics. But if I force index all 46 documents in my index again, then the field_statistics comes back to the correct numbers again:
"field_statistics": {
"sum_doc_freq": 8715,
"doc_count": 46,
"sum_ttf": 27672
}

I am using Elasticsearch 5.6.1 and ingest-attachment 5.6.1 plugin, not sure if this issue is related to this plugin.

Thanks for the helps in advance.

mayya · April 27, 2018, 8:11pm

This is expected behaviour. When you modify a document, you actually delete your previous document and create a new document instead of it. For some time deleted documents are still in memory and counted towards field_statistics (but searches are smart enough to exclude - this is what refresh is for) . Then when segments are merged into a new segment, deleted documents will be completely wiped out, and a new created Lucene document will not contain them anymore and this is where you will get your correct statistics.

You can forcefully merge Lucene segments in rare cases (not recommended to do often)

system · May 25, 2018, 8:11pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Inconsistent doc_freq in _mtermvectors Elasticsearch	1	319	June 17, 2020
Why does ttf have different values for same term in same index? Elasticsearch	8	1746	March 7, 2017
Term Vectors Field Statistics Meaning Elasticsearch	1	359	August 2, 2020
Elasticsearch: total term frequency and doc count from given set of documents Elasticsearch	5	10080	February 9, 2018
Terms stats API Elasticsearch	5	1303	December 28, 2016

Inconsistent sum_doc_freq and sum_ttf numbers in _mtermvectors

Related topics