Hi there,
We have been using Elasticsearch and ingest-attachment to index program files for full-text searching and keyword extraction (filtered by TFxIDF) using metadata from _mtermvectors. However, we recently found an inconsistency about _mtermvectors api.
Suppose we indexed 46 program documents, and use _mtermvectors then we can get field_statistics like this:
"field_statistics": {
"sum_doc_freq": 8715,
"doc_count": 46,
"sum_ttf": 27672
}
Then I modified only one document, suppose it is called a.java, with only adding a space character and indexed this updated document, and used _mtermvectors, I got an updated field_statistics:
"field_statistics": {
"sum_doc_freq": 8995,
"doc_count": 47,
"sum_ttf": 30346
}
From above result, we found the doc_count became to 47 (we only had 46 documents), and sum_doc_freq and sum_ttf have duplicated term counts and seemed that terms in a.java got cloned somehow, with only one more space added.
I tried refreshing my index but I still see the cloned terms field statistics. But if I force index all 46 documents in my index again, then the field_statistics comes back to the correct numbers again:
"field_statistics": {
"sum_doc_freq": 8715,
"doc_count": 46,
"sum_ttf": 27672
}
I am using Elasticsearch 5.6.1 and ingest-attachment 5.6.1 plugin, not sure if this issue is related to this plugin.
Thanks for the helps in advance.