This week I had a severe issue with ElasticSearch in a production
environment related to the particulars of the setup I had. I thought
I'd share it in case anyone else had a similar setup, and especially
if someone had found a good way to solve the problem.
The way I have ElasticSearch setup is:
- I have a 3 node cluster being queried by ~200 machines
- Each of those machines is a dumb webserver
- Each dumb webserver pre-loads a list of available ElasticSearch
nodes on startup before forking. - Each forked child lives for a fairly short time (i.e. only a few
hundred requests), and since it's doing mixed traffic it's likely
that it'll only do 1-2 ElasticSearch queries.
The failure that I had was that the switch pointing to 1/3 ES nodes
went down, so newly forked server children trying had a 1/3 chance of
contacting that machine for their first request and running into their
HTTP connection timeout before moving into the next one.
Thus effectively 1/3 requests to ElasticSearch would have to wait for
$HTTP_TIMEOUT seconds before marking that node as bad and proceeding
onto the next one, and since each child lives for such a few number of
requests the built-in safety valve in the client library of not
retrying queries against known-bad nodes effectively did nothing.
I'm currently pondering a few solutions to this which each have their
own pros and cons.
-
Patch the client library to share the state of what nodes are
good/bad between all processes on the system, e.g. using shared
memory, some dumb file storage etc. This would be relatively easy
and I could feed the changes back to the client library. -
Stick a load balancer in front of the ES boxes. I'd done this
previously and it caused some problems due to the LB only
understanding "connection refused" as a failure mode. I.e. it
didn't understand that it should try again if a node was starting
up and replying with "go away, I'm initializing".That could be solved by a smarter LB, or having the clients retry
N times on the LB in case of failure, hoping that they'll get an
OK node on the next request. -
Stick a long-living intermediary between the dumb machines and the
ES servers, i.e. have search requests served by an API that uses
the ES client library.