Question on timeout on the client side and OOM error on the cluster


(Jae) #1
  1. ES version: 0.90.10

  2. What happened:

We measured the peak traffic and allocated over-provisioned number of EC2
m1.xlarge instances and made it ready for having traffic.
Immediately after turning on the traffic, whole ES cluster went down with
OOM error. I analyzed heap dump and 6.5GB was full of TransportService,
which means ES server instance was backed up with unhandled requests from
clients.

  1. Client's behavior

There are 500 threads doing bulk request on ES cluster with timeout 2
seconds. I guess 2 second timeout would be reasonable but when I checked
rx/tx graph, the graph showed it got 38GB per second, unbelievable numbers,
look at graphs. Does this mean we shouldn't use timeout in a large cluster?

Thank you
Best, Jae

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/368b6a6e-8678-4de6-a7da-ef950c8f2bc8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alexander Reelsen) #2

Hey

maybe you should reduce your parallel bulk indexing. Having 500 parallel
bulk request means you need to reserve a lot of memory for parsing those
requests and keeping some state in memory until all of the requests are
processed and returned. Because a single bulk requests needs to send lots
of data around in the cluster, I assume this as the reason for lots of
TransportService calls.

In terms of bulk indexation, start small and serialized and then go higher
step-by-step, if you need to (keep in mind that bulk indexing already means
you are executing lots of data in parallel).

--Alex

On Fri, Feb 14, 2014 at 8:05 PM, Jae metacret@gmail.com wrote:

  1. ES version: 0.90.10

  2. What happened:

We measured the peak traffic and allocated over-provisioned number of EC2
m1.xlarge instances and made it ready for having traffic.
Immediately after turning on the traffic, whole ES cluster went down with
OOM error. I analyzed heap dump and 6.5GB was full of TransportService,
which means ES server instance was backed up with unhandled requests from
clients.

  1. Client's behavior

There are 500 threads doing bulk request on ES cluster with timeout 2
seconds. I guess 2 second timeout would be reasonable but when I checked
rx/tx graph, the graph showed it got 38GB per second, unbelievable numbers,
look at graphs. Does this mean we shouldn't use timeout in a large cluster?

Thank you
Best, Jae

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/368b6a6e-8678-4de6-a7da-ef950c8f2bc8%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM_%2BtV0iffD0_e6PV0Y3X_oEOt3BK%3DLz-%2BpfeaFxgqEnQg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #3