Lucene vs Elastic Search Document Count difference and its impact on term aggregation buckets

We have an index with lot of auto generated data for load testing and we noticed that there is significant difference in Elasticsearch doc count (using the _count API) vs /indices API e.g
ES documents : ~80 million
Lucene documents : ~2 billion
There is only single nested field in the document so can someone explain what might be causing such a huge inflation in Lucene document count.

Also If I run an term aggregation on this , does this impacts the overall number of buckets which gets created in memory.

How many object does your nested field hold on average? Each nested object is stored as a separate document behind the scenes and does show up in the indices stats.

On an average or 2, 1 hence not sure what might be the cause

@Christian_Dahlqvist Any further insight into this ?

It might help if you show the exact output from the APIs. Do you have a lot of updated and/or deleted documents that show up in the indices api but not in count?

In this case no updated and/or deleted document count
Indices API output :
green, open, index name, uuid, 5, 1, 2450791167, 0, 239.6gb, 119.8gb

Count API output :

"count" : 88336911,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0

My guess is that you have more nested documents than you think you do.

Thanks for getting back @Christian_Dahlqvist , Does these nested documents affect term aggregations even when i am not explicitly querying for these nested fields ?

No, not that i know of.

It may be worthwhile creating a runtime field containing the size of the nested array or maybe even index it as a proper field using an index pipeline. That way you would be able to aggrgate and get statistics on it and see if it all adds up.

thanks for the update

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.