Timeouts under spiky load

How does everybody deal with timeouts under spiky load?

I have a 2-node, 5-shard ES 0.18.7 setup with a 45MB corpus. Each node has 70GB RAM and a 19GB heap size for ES. I'm using the mmapfs store.

Our workload is such that we inundate ES with hundreds of queries over a few seconds. Due to...

  • the number of pending TCP connections that build up at the ESes and
  • the 15-second GC pauses on such big heaps,

...a lot of those requests time out. Worse, the large number of pending connections sometimes causes the JVM to become unresponsive, similar to the situation in https://groups.google.com/forum/#!msg/elasticsearch/fxxJG6iSVrM/3fynSV7xPyYJ.

I can hardly be unique here. What does everyone else do? My current avenue of exploration is messing with the search threadpool configuration, making it use a fixed-size queue:

threadpool:
search:
type: fixed
queue_size: 70
reject_policy: caller

I'm a little fuzzy on what the "caller" reject_policy does, but "abort" would at least return an HTTP 503, which I could catch in my app to trigger a back-off.

Is this the typical approach?

Cheers,
Erik

On Tue, 2012-07-03 at 13:51 -0700, Erik Rose wrote:

How does everybody deal with timeouts under spiky load?

I have a 2-node, 5-shard ES 0.18.7 setup with a 45MB corpus. Each node has 70GB RAM and a 19GB heap size for ES. I'm using the mmapfs store.

Our workload is such that we inundate ES with hundreds of queries over a few seconds. Due to...

  • the number of pending TCP connections that build up at the ESes and
  • the 15-second GC pauses on such big heaps,

Have you assigned the user that is running elasticsearch the right to
lock all 19GB (ulimit -l), and are you using bootstrap.mlockall?

clint

Have you assigned the user that is running elasticsearch the right to

lock all 19GB (ulimit -l), and are you using bootstrap.mlockall?

Not only that, but swap isn't even enabled on that box.

On Thu, 2012-07-05 at 09:28 -0700, Erik Rose wrote:

    Have you assigned the user that is running elasticsearch the
    right to  
    lock all 19GB (ulimit -l), and are you using
    bootstrap.mlockall? 

Not only that, but swap isn't even enabled on that box.

Then I don't understand why you're seeing 15 second GC pauses. We have
40GB of data in our indices, and two nodes with 36GB total, of which
24GB is assigned to the ES heap.

Before we started using mlockall (or turning off swap), we had frequent
long GC pauses.

Since turning off swap, we have none. It is super fast. Don't know what
to suggest.

clint