Timeouts under spiky load

Erik_Rose · July 3, 2012, 8:51pm

How does everybody deal with timeouts under spiky load?

I have a 2-node, 5-shard ES 0.18.7 setup with a 45MB corpus. Each node has 70GB RAM and a 19GB heap size for ES. I'm using the mmapfs store.

Our workload is such that we inundate ES with hundreds of queries over a few seconds. Due to...

the number of pending TCP connections that build up at the ESes and
the 15-second GC pauses on such big heaps,

...a lot of those requests time out. Worse, the large number of pending connections sometimes causes the JVM to become unresponsive, similar to the situation in https://groups.google.com/forum/#!msg/elasticsearch/fxxJG6iSVrM/3fynSV7xPyYJ.

I can hardly be unique here. What does everyone else do? My current avenue of exploration is messing with the search threadpool configuration, making it use a fixed-size queue:

threadpool:
search:
type: fixed
queue_size: 70
reject_policy: caller

I'm a little fuzzy on what the "caller" reject_policy does, but "abort" would at least return an HTTP 503, which I could catch in my app to trigger a back-off.

Is this the typical approach?

Cheers,
Erik

Clinton_Gormley · July 4, 2012, 10:25am

On Tue, 2012-07-03 at 13:51 -0700, Erik Rose wrote:

How does everybody deal with timeouts under spiky load?

I have a 2-node, 5-shard ES 0.18.7 setup with a 45MB corpus. Each node has 70GB RAM and a 19GB heap size for ES. I'm using the mmapfs store.

Our workload is such that we inundate ES with hundreds of queries over a few seconds. Due to...

the number of pending TCP connections that build up at the ESes and

the 15-second GC pauses on such big heaps,

Have you assigned the user that is running elasticsearch the right to
lock all 19GB (ulimit -l), and are you using bootstrap.mlockall?

clint

Erik_Rose · July 5, 2012, 4:28pm

Have you assigned the user that is running elasticsearch the right to

lock all 19GB (ulimit -l), and are you using bootstrap.mlockall?

Not only that, but swap isn't even enabled on that box.

Clinton_Gormley · July 6, 2012, 11:13am

On Thu, 2012-07-05 at 09:28 -0700, Erik Rose wrote:

    Have you assigned the user that is running elasticsearch the
    right to  
    lock all 19GB (ulimit -l), and are you using
    bootstrap.mlockall?

Not only that, but swap isn't even enabled on that box.

Then I don't understand why you're seeing 15 second GC pauses. We have
40GB of data in our indices, and two nodes with 36GB total, of which
24GB is assigned to the ES heap.

Before we started using mlockall (or turning off swap), we had frequent
long GC pauses.

Since turning off swap, we have none. It is super fast. Don't know what
to suggest.

clint

Topic		Replies	Views
Bulk API Connection Timeouts and Frequent Long GC Pauses Elasticsearch	1	485	March 30, 2018
Scalability problems Elasticsearch	7	552	May 23, 2020
Elasticsearch socket block OR timeout Elasticsearch	5	772	November 6, 2019
Timeout, GC overhead and plenty other beginner errors Elasticsearch	5	1140	May 31, 2019
Garbage Collection in ES Elasticsearch	8	3729	July 6, 2017

Timeouts under spiky load

Related topics