Networking issues + dumb ES clients = FAIL

This week I had a severe issue with ElasticSearch in a production
environment related to the particulars of the setup I had. I thought
I'd share it in case anyone else had a similar setup, and especially
if someone had found a good way to solve the problem.

The way I have ElasticSearch setup is:

  • I have a 3 node cluster being queried by ~200 machines
  • Each of those machines is a dumb webserver
  • Each dumb webserver pre-loads a list of available ElasticSearch
    nodes on startup before forking.
  • Each forked child lives for a fairly short time (i.e. only a few
    hundred requests), and since it's doing mixed traffic it's likely
    that it'll only do 1-2 ElasticSearch queries.

The failure that I had was that the switch pointing to 1/3 ES nodes
went down, so newly forked server children trying had a 1/3 chance of
contacting that machine for their first request and running into their
HTTP connection timeout before moving into the next one.

Thus effectively 1/3 requests to ElasticSearch would have to wait for
$HTTP_TIMEOUT seconds before marking that node as bad and proceeding
onto the next one, and since each child lives for such a few number of
requests the built-in safety valve in the client library of not
retrying queries against known-bad nodes effectively did nothing.

I'm currently pondering a few solutions to this which each have their
own pros and cons.

  1. Patch the client library to share the state of what nodes are
    good/bad between all processes on the system, e.g. using shared
    memory, some dumb file storage etc. This would be relatively easy
    and I could feed the changes back to the client library.

  2. Stick a load balancer in front of the ES boxes. I'd done this
    previously and it caused some problems due to the LB only
    understanding "connection refused" as a failure mode. I.e. it
    didn't understand that it should try again if a node was starting
    up and replying with "go away, I'm initializing".

    That could be solved by a smarter LB, or having the clients retry
    N times on the LB in case of failure, hoping that they'll get an
    OK node on the next request.

  3. Stick a long-living intermediary between the dumb machines and the
    ES servers, i.e. have search requests served by an API that uses
    the ES client library.

is option 4 to use a 'real' ES client that knows about cluster state? You
didn't say what your webservers were running. But if it's Java, you can
join the ES cluster in client mode (
Elasticsearch Platform — Find real-time answers at scale | Elastic) and then
you get all of Elasticsearch's clustering for free including
adding/removing nodes when they come up / go bad.

On Sat, Nov 26, 2011 at 4:20 PM, Ævar Arnfjörð Bjarmason
avarab@gmail.comwrote:

This week I had a severe issue with Elasticsearch in a production
environment related to the particulars of the setup I had. I thought
I'd share it in case anyone else had a similar setup, and especially
if someone had found a good way to solve the problem.

The way I have Elasticsearch setup is:

  • I have a 3 node cluster being queried by ~200 machines
  • Each of those machines is a dumb webserver
  • Each dumb webserver pre-loads a list of available Elasticsearch
    nodes on startup before forking.
  • Each forked child lives for a fairly short time (i.e. only a few
    hundred requests), and since it's doing mixed traffic it's likely
    that it'll only do 1-2 Elasticsearch queries.

The failure that I had was that the switch pointing to 1/3 ES nodes
went down, so newly forked server children trying had a 1/3 chance of
contacting that machine for their first request and running into their
HTTP connection timeout before moving into the next one.

Thus effectively 1/3 requests to Elasticsearch would have to wait for
$HTTP_TIMEOUT seconds before marking that node as bad and proceeding
onto the next one, and since each child lives for such a few number of
requests the built-in safety valve in the client library of not
retrying queries against known-bad nodes effectively did nothing.

I'm currently pondering a few solutions to this which each have their
own pros and cons.

  1. Patch the client library to share the state of what nodes are
    good/bad between all processes on the system, e.g. using shared
    memory, some dumb file storage etc. This would be relatively easy
    and I could feed the changes back to the client library.

  2. Stick a load balancer in front of the ES boxes. I'd done this
    previously and it caused some problems due to the LB only
    understanding "connection refused" as a failure mode. I.e. it
    didn't understand that it should try again if a node was starting
    up and replying with "go away, I'm initializing".

That could be solved by a smarter LB, or having the clients retry
N times on the LB in case of failure, hoping that they'll get an
OK node on the next request.

  1. Stick a long-living intermediary between the dumb machines and the
    ES servers, i.e. have search requests served by an API that uses
    the ES client library.

--

Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy

On Sun, Nov 27, 2011 at 01:30, Paul Loy keteracel@gmail.com wrote:

is option 4 to use a 'real' ES client that knows about cluster state? You
didn't say what your webservers were running. But if it's Java, you can join
the ES cluster in client mode
(Elasticsearch Platform — Find real-time answers at scale | Elastic) and then
you get all of Elasticsearch's clustering for free including
adding/removing nodes when they come up / go bad.

The webservers are running Perl and using Clinton Gormley's
Elasticsearch bindings (https://metacpan.org/module/ElasticSearch).

Due to the "we're forking dumb children all the time" architecture
it's not a viable option to have the webservers join the ES cluster as
clients.

Besides, it seems like a pretty big overkill to have all the
webservers in the ES cluster all the time receiving status updates
just to handle the case of a node occasionally being bad.

E.g. one simple way to implement #1 would be to have a
/tmp/elastic-search-state directory containing these files:

servers: a newline-separated list of servers we have
failed-servers: servers that we know have failed
num-requests: how many requests have we made?

When a given child process would make a request it would do:

echo -n . >>/tmp/elastic-search-state/num-requests

So you could get the total number of requests we've made so far on
this box with:

du -b /tmp/elastic-search-state/num-requests

To get a list of OK servers you'd do:

cat /tmp/elastic-search-state/{servers,failed} | sort | uniq -c |

grep "^ *1 " | awk '{print $2}'

And if any child process exceeded the max_requests (as retrieved with
"du -b num-requests") it would clobber the "servers" file and
"failed-servers" if needed.

You could check with stat(2) whether you needed to re-read the files,
and you wouldn't have to deal with locking the files for update since
two children redundantly both getting the server list would be just
fine.

On Sat, 2011-11-26 at 16:30 -0800, Paul Loy wrote:

is option 4 to use a 'real' ES client that knows about cluster state?
You didn't say what your webservers were running. But if it's Java,
you can join the ES cluster in client mode
(Elasticsearch Platform — Find real-time answers at scale | Elastic)
and then you get all of Elasticsearch's clustering for free including
adding/removing nodes when they come up / go bad.

Hi Paul

I don't know much about how the Java client works, but from what I've
seen on the list, I was under the impression that it can take a second
or two to join the cluster. This is fine for long lived processes, but
would introduce too much latency to clients that only do 2-3 requests
before exiting.

The Perl API, by default, while it doesn't join the cluster, does try to
sniff the live nodes in the cluster by sending a nodes request to each
of a list of preconfigured nodes until it gets a successful response.

In this case, where the bad switch was causing 1 node to hang until it
timed out, this still wouldn't have been very useful.

Although, if Avar were to use one of the available async backends
(AnyEvent::HTTP or AnyEvent::Curl) then these sniff requests would be
sent in parallel, and the first successful response would be used.

  1. Patch the client library to share the state of what nodes are
    good/bad between all processes on the system, e.g. using shared
    memory, some dumb file storage etc. This would be relatively easy
    and I could feed the changes back to the client library.

One easy possibility would be to store this data in memcached, which is
frequently already being used in web apps, is fast and won't lock.

clint

Why not just try and move to long lived "client" processes? Seems like an
architecture where client processes are spawned for 2-3 requests is
problematic not just connection wise (elasticsearch or other - db,
memcached), but also resource wise.

On Mon, Nov 28, 2011 at 11:45 AM, Clinton Gormley clint@traveljury.comwrote:

On Sat, 2011-11-26 at 16:30 -0800, Paul Loy wrote:

is option 4 to use a 'real' ES client that knows about cluster state?
You didn't say what your webservers were running. But if it's Java,
you can join the ES cluster in client mode
(Elasticsearch Platform — Find real-time answers at scale | Elastic)
and then you get all of Elasticsearch's clustering for free including
adding/removing nodes when they come up / go bad.

Hi Paul

I don't know much about how the Java client works, but from what I've
seen on the list, I was under the impression that it can take a second
or two to join the cluster. This is fine for long lived processes, but
would introduce too much latency to clients that only do 2-3 requests
before exiting.

The Perl API, by default, while it doesn't join the cluster, does try to
sniff the live nodes in the cluster by sending a nodes request to each
of a list of preconfigured nodes until it gets a successful response.

In this case, where the bad switch was causing 1 node to hang until it
timed out, this still wouldn't have been very useful.

Although, if Avar were to use one of the available async backends
(AnyEvent::HTTP or AnyEvent::Curl) then these sniff requests would be
sent in parallel, and the first successful response would be used.

  1. Patch the client library to share the state of what nodes are
    good/bad between all processes on the system, e.g. using shared
    memory, some dumb file storage etc. This would be relatively easy
    and I could feed the changes back to the client library.

One easy possibility would be to store this data in memcached, which is
frequently already being used in web apps, is fast and won't lock.

clint

On Mon, Nov 28, 2011 at 12:31, Shay Banon kimchy@gmail.com wrote:

Why not just try and move to long lived "client" processes? Seems like an
architecture where client processes are spawned for 2-3 requests is
problematic not just connection wise (elasticsearch or other - db,
memcached), but also resource wise.

The processes live for more than 2-3 requests in total, they live to
do around 150 requests on average. But out of those maybe only 2-3 are
search requests.

The reason it's set up like this is that fork(2) is cheap, and
depending on your design incrementally cleaning up memory or
incrementally doing garbage collection might be a lot more expensive
than just killing the process and having its memory freed by the OS,
and then replacing it with a new process forked from your
ready-to-be-used parent process.

MySQL deals with this use case well because it maintains a connection
pool, so if you reap a child and another fresh process connects MySQL
doesn't incur any noticeable overhead in setting up the connection,
unlike some other databases.

Not sure I get it. Are you saying that you create a pool of connections in
the parent process, and then pass a connection to the child process
(sharing FD?) from the pool, and when that child process is dead, return
the connection to the pool? If so, possibly you can do it with the http
connection as well?

Still don't buy the fork VS. long running processes reasoning..., but thats
not the platform to have this discussion.

On Mon, Nov 28, 2011 at 7:34 PM, Ævar Arnfjörð Bjarmason
avarab@gmail.comwrote:

On Mon, Nov 28, 2011 at 12:31, Shay Banon kimchy@gmail.com wrote:

Why not just try and move to long lived "client" processes? Seems like an
architecture where client processes are spawned for 2-3 requests is
problematic not just connection wise (elasticsearch or other - db,
memcached), but also resource wise.

The processes live for more than 2-3 requests in total, they live to
do around 150 requests on average. But out of those maybe only 2-3 are
search requests.

The reason it's set up like this is that fork(2) is cheap, and
depending on your design incrementally cleaning up memory or
incrementally doing garbage collection might be a lot more expensive
than just killing the process and having its memory freed by the OS,
and then replacing it with a new process forked from your
ready-to-be-used parent process.

MySQL deals with this use case well because it maintains a connection
pool, so if you reap a child and another fresh process connects MySQL
doesn't incur any noticeable overhead in setting up the connection,
unlike some other databases.