Run about 1k sequential non-scoring queries hourly, 1-2 times daily query times increase up to 5-6x

Hello, I am relatively new to Elasticsearch. I'm getting variable query performance and I'd like to understand what is going on.

I have a public dataset which I'm indexing into a 1-node cluster on Elasticsearch 6.7. There's one index per year, right now all indices are hot, because I want this data searchable. The current year's index is about 31m documents (though nested fields increase the document count 5-6x.)

Every hour, at 15 past, I run a series of about 1,000 non-scoring queries. The query program runs the queries sequentially and there is only one outstanding query at a time.

21-22 hours of the day, the queries finish after 15 minutes. But sometimes, the queries takes much longer. What normally takes 15 minutes starts taking 2-3 times this, and then if the query program kicks off shortly after the last cycle ended, the queries can take 5-6 times the 15 minutes. The way to break the fever is to skip one cycle of these queries, then things return to normal, 15 minutes for these 1,000 queries.

I know what resources my node has, I know its not up to corporate redundancy standards at all! I'd just like to understand what is going on.

The node has 8 GB memory. The minimum and maximum Java heap size in jvm.options is set to 4 GB.

During these distressed query situations, there are no new or different processes running on the node. Several thousand new documents are indexed in a batch every 10 minutes, but this is always going on.

I read about the memory pressure indicator:
So I did a snapshot of _nodes/stats during a time when queries were slow.

"nodes" : {
"node_name" : {
"jvm" : {
"mem" : {
"pools" : {
"old" : {
"used_in_bytes" : 2524861496,
"max_in_bytes" : 4120510464,

If this is the right way to calculate memory pressure, my percentage is around 59-63%, and it doesn't spike during the times when the query times are distressed.

Yes, doubling the memory or adding a second node probably fixes the problem, but that doesn't increase my understanding of Elasticsearch. And most of the time the performance is not just predictable, but also quite acceptable. Basically, I'm looking for some memory or other metric which can predict the current query performance. Any enlightenment would be greatly appreciated.


Are the slow instance randomly distributed or are they always at the same time(s) every day? If you leave it running in a "slow" state you seem to imply that it'll keep being slow until you give it some time off to recover. Are you sure that it needs this recovery time and it's not just either (a) running concurrently with something else that's slowing things down or (b) running concurrently with itself since the last attempt took more than one hour?

My first suspicions would be GC pauses and IO issues. The output of GET _nodes/stats includes information about time spent in GC. Does this vary between the "good" and "bad" runs? To look at I/O I'd try something like iostat or sar and see if there are any differences. I'd particularly look at the amount of data read and the amount of time spent waiting.

Elasticsearch (well, Lucene really) uses the filesystem cache heavily, so if there's something else using a lot of filesystem cache then this can have a bad effect on performance. For instance a daily security scan that reads every file on the disk will completely blow out the cache. Also beware of noisy neighbours if you are running on a VM.

Thanks for the reply David. The distressed situation occurs at different times, now not a daily cronjob. There is also not more than one instance of the query program running.

Indeed my experience is that the queries will be slow until I give it some time off.

I will look into I/O, but in the meantime with GC times:

    "gc" : {
      "collectors" : {
        "young" : {
          "collection_count" : 207437,
          "collection_time_in_millis" : 4320328 [1h 12m]
        "old" : {
          "collection_count" : 532,
          "collection_time_in_millis" : 32560 [5m 25s]

I'm assuming the times is since the Elasticsearch/java process has been running? That number is about 197 hours, so the time spent in young garbage collection is about 0.6%. I have no idea if that's considered a lot.

I'll cook up something today to gather this gc stat every hour, and see if the increase in the time counter is different for a "good" query run and a "bad" query run.

I'm using openjdk 10.0.2. I am self-hosting in AWS on a m5.large (non-variable CPU) instance. The partition mounting /var/lib/elasticsearch is a SSD-based volume, what they call "gp2".


Yes, that's right, these times are cumulative so you have to collect stats twice and calculate deltas.

It doesn't sound like a lot to me.

Ok, the architecture under this is quite opaque, but one does hear about noisy neighbours on AWS from time to time. Let's see what the IO stats are saying.

I think the culprit is I/O. Reading docs, I was reminded that AWS general purpose volumes, even SSD, are not constant performance. They give you some credits for increased I/O operations. If one deplete the extra credits, the performance reverts to baseline. Apparently the baseline performance for my use of Elasticsearch is very poor. (Its possible to monitor this, I just haven't gone there), but the pattern of depleting credits, waiting, then restored performance is consistent with the observed behavior.

The gc time delta for a bad query run and good query run are each around 10 seconds without variation.

Thanks for your help David. I learned something.

For others in the future, while researching garbage collection, I came across these two articles which are very informative:
What are heaps and garbage collection?

Partially explains the output of GET /_nodes/stats


1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.