Inconsistent sum_doc_freq and sum_ttf numbers in _mtermvectors

Hi there,

We have been using Elasticsearch and ingest-attachment to index program files for full-text searching and keyword extraction (filtered by TFxIDF) using metadata from _mtermvectors. However, we recently found an inconsistency about _mtermvectors api.

Suppose we indexed 46 program documents, and use _mtermvectors then we can get field_statistics like this:
"field_statistics": {
"sum_doc_freq": 8715,
"doc_count": 46,
"sum_ttf": 27672
}

Then I modified only one document, suppose it is called a.java, with only adding a space character and indexed this updated document, and used _mtermvectors, I got an updated field_statistics:
"field_statistics": {
"sum_doc_freq": 8995,
"doc_count": 47,
"sum_ttf": 30346
}
From above result, we found the doc_count became to 47 (we only had 46 documents), and sum_doc_freq and sum_ttf have duplicated term counts and seemed that terms in a.java got cloned somehow, with only one more space added.

I tried refreshing my index but I still see the cloned terms field statistics. But if I force index all 46 documents in my index again, then the field_statistics comes back to the correct numbers again:
"field_statistics": {
"sum_doc_freq": 8715,
"doc_count": 46,
"sum_ttf": 27672
}

I am using Elasticsearch 5.6.1 and ingest-attachment 5.6.1 plugin, not sure if this issue is related to this plugin.

Thanks for the helps in advance.

This is expected behaviour. When you modify a document, you actually delete your previous document and create a new document instead of it. For some time deleted documents are still in memory and counted towards field_statistics (but searches are smart enough to exclude - this is what refresh is for) . Then when segments are merged into a new segment, deleted documents will be completely wiped out, and a new created Lucene document will not contain them anymore and this is where you will get your correct statistics.

You can forcefully merge Lucene segments in rare cases (not recommended to do often)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.