Lucene vs Elastic Search Document Count difference and its impact on term aggregation buckets

S_Star · July 19, 2023, 8:57pm

We have an index with lot of auto generated data for load testing and we noticed that there is significant difference in Elasticsearch doc count (using the _count API) vs /indices API e.g
ES documents : ~80 million
Lucene documents : ~2 billion
There is only single nested field in the document so can someone explain what might be causing such a huge inflation in Lucene document count.

Also If I run an term aggregation on this , does this impacts the overall number of buckets which gets created in memory.

Christian_Dahlqvist · July 20, 2023, 4:32am

How many object does your nested field hold on average? Each nested object is stored as a separate document behind the scenes and does show up in the indices stats.

S_Star · July 20, 2023, 10:59am

On an average or 2, 1 hence not sure what might be the cause

S_Star · July 21, 2023, 11:37am

@Christian_Dahlqvist Any further insight into this ?

Christian_Dahlqvist · July 21, 2023, 1:00pm

It might help if you show the exact output from the APIs. Do you have a lot of updated and/or deleted documents that show up in the indices api but not in count?

S_Star · July 21, 2023, 1:40pm

In this case no updated and/or deleted document count
Indices API output :
green, open, index name, uuid, 5, 1, 2450791167, 0, 239.6gb, 119.8gb

Count API output :

Blockquote
{
"count" : 88336911,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
}
}

Christian_Dahlqvist · July 21, 2023, 5:58pm

My guess is that you have more nested documents than you think you do.

S_Star · July 21, 2023, 8:50pm

Thanks for getting back @Christian_Dahlqvist , Does these nested documents affect term aggregations even when i am not explicitly querying for these nested fields ?

Christian_Dahlqvist · July 22, 2023, 5:36am

No, not that i know of.

It may be worthwhile creating a runtime field containing the size of the nested array or maybe even index it as a proper field using an index pipeline. That way you would be able to aggrgate and get statistics on it and see if it all adds up.

S_Star · July 23, 2023, 8:10pm

thanks for the update

system · August 20, 2023, 8:10pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Marvel doc count higher than count api Elasticsearch	4	1039	July 5, 2017
Understanding Index Stats API Better Elasticsearch	3	365	July 6, 2017
Incorrect Doc Count vs index total returns in ES Elasticsearch	5	2758	April 2, 2020
Understanding difference between /_cat/indices and /_search match_all Elasticsearch	2	632	July 6, 2017
Nested documents performance anomaly Elasticsearch	6	595	June 3, 2019

Lucene vs Elastic Search Document Count difference and its impact on term aggregation buckets

Related topics