Hi there,

We have been using Elasticsearch and ingest-attachment to index program files for full-text searching and keyword extraction (filtered by TFxIDF) using metadata from _mtermvectors. However, we recently found an inconsistency about _mtermvectors api.

Suppose we indexed 46 program documents, and use _mtermvectors then we can get field_statistics like this:

"field_statistics": {

"sum_doc_freq": 8715,

"doc_count": 46,

"sum_ttf": 27672

}

Then I modified only one document, suppose it is called a.java, with only adding a space character and indexed this updated document, and used _mtermvectors, I got an updated field_statistics:

"field_statistics": {

"sum_doc_freq": 8995,

"doc_count": 47,

"sum_ttf": 30346

}

From above result, we found the doc_count became to 47 (we only had 46 documents), and sum_doc_freq and sum_ttf have duplicated term counts and seemed that terms in a.java got cloned somehow, with only one more space added.

I tried refreshing my index but I still see the cloned terms field statistics. But if I force index all 46 documents in my index again, then the field_statistics comes back to the correct numbers again:

"field_statistics": {

"sum_doc_freq": 8715,

"doc_count": 46,

"sum_ttf": 27672

}

I am using Elasticsearch 5.6.1 and ingest-attachment 5.6.1 plugin, not sure if this issue is related to this plugin.

Thanks for the helps in advance.