We've recently hit a problem whereby not all documents are having a terms aggregation run across them, even though the documents match all aggregation filters. This does not happen every time (roughly ~10% of runs cause this problem). Force merging the affected index down to a single segment resolves the issue, however this shouldn't be required.
To give a brief overview;
Spark writes 3 documents to Elasticsearch using the saveToEs method.
The index is refreshed.
Terms aggregation is run.
Results are incorrect.
I've looked into the segments for the index when it's failing to return the correct results, and it has 2 segments, one of which appears to have the document that is not being aggregated against.
Elasticsearch 6.1.0.
I've attached gists for the settings etc. In this case it was the document with id 5a5ad8be80f4913e3a7f564fb3dc20b3ab855382 that is not being returned in the aggregation results. the source is however returned if size is set to a non-zero value.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.