Count API Performance Downgraded After cluster upgrade from 6.8.4 to 7.5

Hi,

After cluster upgraded from 6.8.4 to 7.5.0.
the performance on count api in 7.5 is much slower than in 6.8.4

in Kibana run this

GET /myindex or alias/_count

in 6.8.4 all count API response time is under sub second ( no performance downgrade with/without indexing process running)
in 7.5 the response time is over 1 second ( some time is over 20 seconds when indexing process is running on the index )

Any suggestion?

Thank you !

I'm seeing the same issue! Seems like the fastest way now might be to use _cat/indices and add up the values yourself. It might not be as accurate though and is still slower than older versions of elasticsearch.

Our old way with _count

time curl -H "Content-Type: application/json" -s --insecure https://localhost:9200/_count
{"count":182116470792,"_shards":{"total":20107,"successful":20107,"skipped":0,"failed":0}}

real	0m6.795s

_cat/count is the same

time curl -H "Content-Type: application/json" -s --insecure 'https://localhost:9200/_cat/count?format=json'
[{"epoch":"1575661460","timestamp":"19:44:20","count":"182117127449"}]

real	0m6.523s

vs /_cat/indices is faster

time curl -H "Content-Type: application/json" -s --insecure 'https://localhost:9200/_cat/indices?h=index,docs.count&format=json' > /dev/null

real	0m2.351s

@Andy_Wick Thanks for the follow-up. The interesting point is that COUNT API in version 6.* and before is much faster ( same as using _cat/Indexing ) . I have not seen any release notes talking about Count API changes in version 7* .

Any help is highly appreciated.

Opened https://github.com/elastic/elasticsearch/issues/50198 for this

I wonder if this is a side-effect of the notion of "search idle" added in 7.x. In 6.x and before, we refresh automatically in the background by default very second so requests always the use the available searcher while in 7.x a request made on a "search idle" shard will be parked until the refresh is done. Would you be able to provide us with the outputs of the (hot_threads)[https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html) API while the slow query is running ? This behavior should only affect the first request that hits a search-idle shard configured with the default refresh_ratio. You can also opt-out from this behavior in 7.x by setting an explicit index.refresh_interval.

Seems like the fastest way now might be to use _cat/indices and add up the values yourself.

This API does not use _search to retrieve the docs.count so it is expected to be faster. It uses the index statistics that are exposed per reader instance and more importantly does not check if a refresh is needed or not.

So I should have mentioned that these are time based indices. So I'll switch to indices that are NOT being written to and see the same issue, and even manually call refresh to them. (sessions2-190* will match Jan-Sept)

time curl -H "Content-Type: application/json" -s --insecure 'https://localhost:9200/sessions2-18*,sessions2-190*/_refresh'
{"_shards":{"total":38075,"successful":38068,"failed":0}}
real	0m5.950s

time curl -H "Content-Type: application/json" -s --insecure 'https://localhost:9200/sessions2-18*,sessions2-190*/_count'
{"count":148953359394,"_shards":{"total":19034,"successful":19034,"skipped":0,"failed":0}}
real	0m5.704s

time curl -H "Content-Type: application/json" -s --insecure 'https://localhost:9200/sessions2-18*/_count'
{"count":34383716799,"_shards":{"total":7226,"successful":7226,"skipped":0,"failed":0}}
real	0m0.806s

time curl -H "Content-Type: application/json" -s --insecure 'https://localhost:9200/sessions2-190*/_count'
{"count":114569642595,"_shards":{"total":11808,"successful":11808,"skipped":0,"failed":0}}
real	0m2.669s

You know what this just pointed out is that _count is taking about the same time as _refresh. _count isn't calling refresh when it doesn't need to is it? These indices haven't been written to for months now.

Oh and they have and have always had "refresh_interval":"60s"

Oh and they have and have always had "refresh_interval":"60s"

So my theory is wrong and something else is taking time. Can you try to get the output of the hot_threads API when the _count query is running ?

Sure I can send hot threads privately. Where should I send?

What about my question where if I just do sessions2-18 its 0.8 second, sessions2-190 its 2.6 seconds, but both is 5.7s? That really seems like the number of shards/indices is causing issue or bug?

What about my question where if I just do sessions2-18 its 0.8 second, sessions2-190 its 2.6 seconds, but both is 5.7s? That really seems like the number of shards/indices is causing issue or bug?

That's a lot of shards and the performance will greatly depend on the number of nodes that you have in your cluster. Do you really need that number of shards ? I am not aware of any change in 7.x that would affect index patterns but this looks like a big number considering the total number of documents involved. What is your policy to create new shards ?

Yes it is a decent sized cluster, 69 nodes, over a PB of data. We used to do 4 indices a day, now doing 1 and slowly shrinking down the old indices. Whenever I see non linear growth in execution time (0.8 + 2.6 should mean both together should take ~4s not ~6s) I usually suspect some kind of Qing issue or a loop inside of a loop that shouldn't be.

This API is slower than it used to be, if the answer is WAD, then I'll let it go, I've already switched to the work around.

That's hard to say to be honest. You have a slowdown in 7.x that is unexplained at the moment but I mentioned the number of shards because that's an actionable item that should speed up your queries. I also wonder if the slowdown you're seeing are inlined with what @jihua.zhong describes in the initial post.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.