We have an index with lot of auto generated data for load testing and we noticed that there is significant difference in Elasticsearch doc count (using the _count API) vs /indices API e.g
ES documents : ~80 million
Lucene documents : ~2 billion
There is only single nested field in the document so can someone explain what might be causing such a huge inflation in Lucene document count.
Also If I run an term aggregation on this , does this impacts the overall number of buckets which gets created in memory.
How many object does your nested field hold on average? Each nested object is stored as a separate document behind the scenes and does show up in the indices stats.
It might help if you show the exact output from the APIs. Do you have a lot of updated and/or deleted documents that show up in the indices api but not in count?
Thanks for getting back @Christian_Dahlqvist , Does these nested documents affect term aggregations even when i am not explicitly querying for these nested fields ?
It may be worthwhile creating a runtime field containing the size of the nested array or maybe even index it as a proper field using an index pipeline. That way you would be able to aggrgate and get statistics on it and see if it all adds up.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.