Query intermittent performance issue

Hello,

I’m on Elastic Cloud 8.12.

I’m facing an issue where the same query can take from 500ms to more than 30s.

We are working with rather simple data stored in data streams but we also reproduced this behavior with classic Elasticsearch indices.

I have profile the query and successfully reproduced similar issues with very simple queries such as a term filter or a range filter: { "query": {"bool": {"filter": [ { "range": {"_expirationDate": { "gte": "2024-04-10T14:11:40Z" } } } ]} }, "size": 501}.
For instance, about 12s spent just for this range query on a single index (all spent in the match section of the profiler).

If I run the same query directly after, the performance will be much better thanks to the cache.

I also investigated Elasticsearch metrics in Kibana and CPU is always below 30%, JVM Head memory is fine, but there are spikes in the read I/O that might explain the issue:

ioRead

I suspect a configuration issue, maybe on the indexes shards or the lack of OS cache memory but I’m fairly new to ES and not sure on how to continue my investigations.

Thanks,
Paul

Bonjour Paul :wink:

What is the output of:

GET /_cat/nodes?v
GET /_cat/health?v
GET /_cat/indices?v

What is the configuration you chose for the cloud instance? What kind of "hardware profile" and how much memory?

Note that I suspect as well that if you remove:

"size": 501

from the request, that could be faster? Could you check that?

Hello,

Thanks for the quick response, here are the info you asked, don’t hesitate if you need anything else.

  1. Get /_cat/nodes?v (sorry I wasn't able to copy as a table nor as a correct image..). We can see that the ram.percent is super high but i'm not sure if it is really an issue after reading about that.

Heap.percent|ram.percent|cpu|Load_1m|Load_5m|Load_15m|Node.role|Maste|
49 100 0 1.23 0.91 0.86 rw -
86 100 0 1.77 2.24 2.11 rw -
34 85 1 1.69 1.40 1.43 cr -
45 99 0 1.94 1.86 1.70 mv -
20 99 14 4.79 4.20 3.74 himrst *
51 100 12 2.51 3.32 3.60 himrst -

  1. GET /_cat/health?v
status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
green 6 8 254 127 0 0 0 0 - 100%
  1. GET /_cat/indices?v

I have 91 indices but here is a representative example (the issue is present for queries on both type of index, video metadata and motion events (data stream))

health status index pri rep docs.count docs.deleted store.size pri.store.size / dataset.size
Green open Index-type-1 1 1 193079552 0 60gb 30gb
Green open Index-type-2 1 1 9313735 60344 2.5gb 1.2gb
  1. General information
  • Hardware profile: General purpose (was using Storage Optimized before and had the same issue)

  • Global Memory (it’s a test system so we currently don’t have much warm/cold)
    conf

  • I just tested without the Size parameter and the results are probably a bit better but overall similar (it’s complicated to be sure as the query time is very unpredictable but I still had a index taking 7s to answer the time range query).

Thanks,
Paul