Fault tolerant elasticsearch (JVM Heap OOM)

Dear elasticsearch users

I am running a PHP web application whose data layer is based out of 3
elasticsearch nodes.

Once in a while there might be an individual node failing (e.g. recently
one run into a JVM Heap OOM) but the cluster would still become green (2
nodes required) so I would like to make the application fault tollerant.

What is the best practice to avoid sending requests to the instance that
fails?

Would you implement healthchecks at the application layer?

Any examples or advice would be much appreciated

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

What are you exactly doing? Are you indexing documents? Are you
searching for documents? Do you keep your queries in a logfile so you
can trace what is going on? Did you enable logging at GC level in
Elasticsearch? Have you a strategy for sizing your application, that is,
have you calculated in advance how much resources you will need?

Jörg

Am 02.04.13 18:03, schrieb foufos:

Dear elasticsearch users

I am running a PHP web application whose data layer is based out of 3
elasticsearch nodes.

Once in a while there might be an individual node failing (e.g.
recently one run into a JVM Heap OOM) but the cluster would still
become green (2 nodes required) so I would like to make the
application fault tollerant.

What is the best practice to avoid sending requests to the instance
that fails?

Would you implement healthchecks at the application layer?

Any examples or advice would be much appreciated

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We've considered using haproxy to loadbalance (round-robin) the REST calls
to the different nodes. haproxy can easily do a health check, by pinging
the machine or sending HTTP requests and checking the response.
In the end we ended up with an off-the-shelve-loadbalancer from our hosting
company. This seems to work just fine, but we haven't tested that ourselves.

Jaap

Jaap Taal

[ Q42 BV | tel 070 44523 42 | direct 070 44523 65 | http://q42.nl |
Waldorpstraat 17F, Den Haag | Vijzelstraat 72 unit 4.23, Amsterdam | KvK
30164662 ]

On Tue, Apr 2, 2013 at 6:07 PM, Jörg Prante joergprante@gmail.com wrote:

What are you exactly doing? Are you indexing documents? Are you searching
for documents? Do you keep your queries in a logfile so you can trace what
is going on? Did you enable logging at GC level in Elasticsearch? Have you
a strategy for sizing your application, that is, have you calculated in
advance how much resources you will need?

Jörg

Am 02.04.13 18:03, schrieb foufos:

Dear elasticsearch users

I am running a PHP web application whose data layer is based out of 3
elasticsearch nodes.

Once in a while there might be an individual node failing (e.g. recently
one run into a JVM Heap OOM) but the cluster would still become green (2
nodes required) so I would like to make the application fault tollerant.

What is the best practice to avoid sending requests to the instance that
fails?

Would you implement healthchecks at the application layer?

Any examples or advice would be much appreciated

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey

another solution might be to run a client node on your web application
server. This is an elasticsearch node, which does not hold any data and is
not allowed to become master, but still knows the clusters internal
structure and which nodes can be queried (and a little bit more). There is
a comment about that configuration in the default elasticsearch.yml
configuration as well (which is not the best place to put it obvously).

--Alex

On Tue, Apr 2, 2013 at 6:03 PM, foufos foufos7@gmail.com wrote:

Dear elasticsearch users

I am running a PHP web application whose data layer is based out of 3
elasticsearch nodes.

Once in a while there might be an individual node failing (e.g. recently
one run into a JVM Heap OOM) but the cluster would still become green (2
nodes required) so I would like to make the application fault tollerant.

What is the best practice to avoid sending requests to the instance that
fails?

Would you implement healthchecks at the application layer?

Any examples or advice would be much appreciated

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

@jorge What are the most common causes for OOM in ES?

@foufos I know if you use the java client and all addresses to the
transport client it will manage this for you. otherwise you can just have a
list of servers to try your request against if you don't want a load
balancer (I am assuming those would be the only ways if you opt for using
the REST api via http)

On Thursday, April 4, 2013 10:33:33 AM UTC+4, Alexander Reelsen wrote:

Hey

another solution might be to run a client node on your web application
server. This is an elasticsearch node, which does not hold any data and is
not allowed to become master, but still knows the clusters internal
structure and which nodes can be queried (and a little bit more). There is
a comment about that configuration in the default elasticsearch.yml
configuration as well (which is not the best place to put it obvously).

--Alex

On Tue, Apr 2, 2013 at 6:03 PM, foufos <fou...@gmail.com <javascript:>>wrote:

Dear elasticsearch users

I am running a PHP web application whose data layer is based out of 3
elasticsearch nodes.

Once in a while there might be an individual node failing (e.g. recently
one run into a JVM Heap OOM) but the cluster would still become green (2
nodes required) so I would like to make the application fault tollerant.

What is the best practice to avoid sending requests to the instance that
fails?

Would you implement healthchecks at the application layer?

Any examples or advice would be much appreciated

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

OOM happens when the heap size is not sufficient.

In ES it has to be considered for what workloads heap space is required:

  • for large segment merging. The bigger the index grows, the more heap
    is required for segment merging
  • for large documents and large bulks while indexing
  • for large result sets
  • and for field caching for filtering and faceting

Finding a reasonable heap size requires some testing under different
workloads. There is no general rule for a "correct" heap size.

You can tackle OOM with scaling out (adding more nodes) or scaling up
(add more RAM per node) or streamline the resource consumption during
the lifecycle of the ES process (smaller segments while merging, smaller
bulk indexing, smaller query results, avoiding "bad" queries with too
heavy resource consumption)

Jörg

Am 04.04.2013 11:37, schrieb Mo:

@jorge What are the most common causes for OOM in ES?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

After investigation it turns out we had a lot of exceptions due to wrong
mapping attributes.
After fixing this, we haven't experienced another similar issue.
can this cause OOM?

@Jorg We have about 850K documents fairly small in size. and we also have
routing set up to have less overhead.

So the problem is temporarily not triggered but we have to create a fall
back in another server becomes unresponsive and we get another split brain
scenario

So now we are considering of implementing a solution along the lines @alex
suggested.
You think that by doing something like that you can avoid a split brain
scenario?

thank you
foufos

On Thursday, 4 April 2013 15:28:03 UTC+3, Jörg Prante wrote:

OOM happens when the heap size is not sufficient.

In ES it has to be considered for what workloads heap space is required:

  • for large segment merging. The bigger the index grows, the more heap
    is required for segment merging
  • for large documents and large bulks while indexing
  • for large result sets
  • and for field caching for filtering and faceting

Finding a reasonable heap size requires some testing under different
workloads. There is no general rule for a "correct" heap size.

You can tackle OOM with scaling out (adding more nodes) or scaling up
(add more RAM per node) or streamline the resource consumption during
the lifecycle of the ES process (smaller segments while merging, smaller
bulk indexing, smaller query results, avoiding "bad" queries with too
heavy resource consumption)

Jörg

Am 04.04.2013 11:37, schrieb Mo:

@jorge What are the most common causes for OOM in ES?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.