How does everybody deal with timeouts under spiky load?
I have a 2-node, 5-shard ES 0.18.7 setup with a 45MB corpus. Each node has 70GB RAM and a 19GB heap size for ES. I'm using the mmapfs store.
Our workload is such that we inundate ES with hundreds of queries over a few seconds. Due to...
- the number of pending TCP connections that build up at the ESes and
- the 15-second GC pauses on such big heaps,
...a lot of those requests time out. Worse, the large number of pending connections sometimes causes the JVM to become unresponsive, similar to the situation in https://groups.google.com/forum/#!msg/elasticsearch/fxxJG6iSVrM/3fynSV7xPyYJ.
I can hardly be unique here. What does everyone else do? My current avenue of exploration is messing with the search threadpool configuration, making it use a fixed-size queue:
I'm a little fuzzy on what the "caller" reject_policy does, but "abort" would at least return an HTTP 503, which I could catch in my app to trigger a back-off.
Is this the typical approach?