Cluster crash, symptoms and possible explanation

Hi folks,

our cluster was "crashed" this night.
We have a couple symptoms and try to narrow down the problem.
Our setup: 3 Nodes in AWS, ES version 19.11, 4 indices for different
services.
Master was node1. The cpu load of this node is rised suddenly to 100% from
3:00 to 3:30.
Other nodes cpu load was small. Logs are empty. It was only the cpu load,
memory consumption, network, etc.everything was normal.
Services which wanted to connect to their indices are timed out after one
minute with no response.

What is happened here? Could a "slow" query form only one service be a
trigger for this? What about the other nodes in the cluster, why they did
not provide
any results for other services from indices which are still working(on node
2 & 3)?

The full cluster restart was the only solution for this.
But how can we prevent this case(one node down, the whole cluster does not
answer)?

Cheers,
Vadim

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Vadim,

Can you say a bit more about your cluster setup?

  1. How many primary shards you had per index? How many replicas?
  2. Was the node that experienced high cpu load also the cluster master at
    them? (you can see in the logs which node was elected master).

Cheers,
Boaz

On Tuesday, July 9, 2013 9:44:07 AM UTC+2, Vadim Kisselmann wrote:

Hi folks,

our cluster was "crashed" this night.
We have a couple symptoms and try to narrow down the problem.
Our setup: 3 Nodes in AWS, ES version 19.11, 4 indices for different
services.
Master was node1. The cpu load of this node is rised suddenly to 100% from
3:00 to 3:30.
Other nodes cpu load was small. Logs are empty. It was only the cpu load,
memory consumption, network, etc.everything was normal.
Services which wanted to connect to their indices are timed out after one
minute with no response.

What is happened here? Could a "slow" query form only one service be a
trigger for this? What about the other nodes in the cluster, why they did
not provide
any results for other services from indices which are still working(on
node 2 & 3)?

The full cluster restart was the only solution for this.
But how can we prevent this case(one node down, the whole cluster does not
answer)?

Cheers,
Vadim

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Boaz,

thanks for your reply.

  1. It's the default setting, 3 nodes: 5 shards x 1 replica per index.
  2. It was the master(high cpu load, and only the cpu, ram, hdd i/o,
    network, everything was fine).
    After investigating everything like tomcat-logs from my
    services(connections, errors), settings and so on, i found nothing
    suspicious. Everything
    is like in the past months.
    I have only one idea: ES has an bug in this old version (19.11) and
    something caused an endless loop. Because only the cpu load was at 100% on
    all 8 cores, but nothing else on this machine.

Cheers,
Vadim

Am Dienstag, 9. Juli 2013 15:35:54 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

Can you say a bit more about your cluster setup?

  1. How many primary shards you had per index? How many replicas?
  2. Was the node that experienced high cpu load also the cluster master at
    them? (you can see in the logs which node was elected master).

Cheers,
Boaz

On Tuesday, July 9, 2013 9:44:07 AM UTC+2, Vadim Kisselmann wrote:

Hi folks,

our cluster was "crashed" this night.
We have a couple symptoms and try to narrow down the problem.
Our setup: 3 Nodes in AWS, ES version 19.11, 4 indices for different
services.
Master was node1. The cpu load of this node is rised suddenly to 100%
from 3:00 to 3:30.
Other nodes cpu load was small. Logs are empty. It was only the cpu load,
memory consumption, network, etc.everything was normal.
Services which wanted to connect to their indices are timed out after one
minute with no response.

What is happened here? Could a "slow" query form only one service be a
trigger for this? What about the other nodes in the cluster, why they did
not provide
any results for other services from indices which are still working(on
node 2 & 3)?

The full cluster restart was the only solution for this.
But how can we prevent this case(one node down, the whole cluster does
not answer)?

Cheers,
Vadim

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Vadim,

I don't know of any bug that causes such symptoms but you never know. It
may also be other stuff like scripts etc. Next time it happens (if it does,
I understand it's rare) calling the hot threads api would really help
diagnosing it (
http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads/)

Cheers,
Boaz

On Wed, Jul 10, 2013 at 11:50 AM, Vadim Kisselmann
v.kisselmann@gmail.comwrote:

Hi Boaz,

thanks for your reply.

  1. It's the default setting, 3 nodes: 5 shards x 1 replica per index.
  2. It was the master(high cpu load, and only the cpu, ram, hdd i/o,
    network, everything was fine).
    After investigating everything like tomcat-logs from my
    services(connections, errors), settings and so on, i found nothing
    suspicious. Everything
    is like in the past months.
    I have only one idea: ES has an bug in this old version (19.11) and
    something caused an endless loop. Because only the cpu load was at 100% on
    all 8 cores, but nothing else on this machine.

Cheers,
Vadim

Am Dienstag, 9. Juli 2013 15:35:54 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

Can you say a bit more about your cluster setup?

  1. How many primary shards you had per index? How many replicas?
  2. Was the node that experienced high cpu load also the cluster master at
    them? (you can see in the logs which node was elected master).

Cheers,
Boaz

On Tuesday, July 9, 2013 9:44:07 AM UTC+2, Vadim Kisselmann wrote:

Hi folks,

our cluster was "crashed" this night.
We have a couple symptoms and try to narrow down the problem.
Our setup: 3 Nodes in AWS, ES version 19.11, 4 indices for different
services.
Master was node1. The cpu load of this node is rised suddenly to 100%
from 3:00 to 3:30.
Other nodes cpu load was small. Logs are empty. It was only the cpu
load, memory consumption, network, etc.everything was normal.
Services which wanted to connect to their indices are timed out after
one minute with no response.

What is happened here? Could a "slow" query form only one service be a
trigger for this? What about the other nodes in the cluster, why they did
not provide
any results for other services from indices which are still working(on
node 2 & 3)?

The full cluster restart was the only solution for this.
But how can we prevent this case(one node down, the whole cluster does
not answer)?

Cheers,
Vadim

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Boaz,

we had an bigger crash last weekend, and now we have problems to rebalance
our cluster. Suspiciously again at 3am.
The master is fully loaded with 100% cpu, this seems to block disk and
network on AWS, because the other nodes don't replicate anything. You can
see with atop, that disk reads/writes on master are at 0, MBr/s is between
0-1MB.
Hot_threads an master are busy with over 100% CPU load. It's weird.

Cheers,
Vadim

Am Donnerstag, 11. Juli 2013 21:37:57 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

I don't know of any bug that causes such symptoms but you never know. It
may also be other stuff like scripts etc. Next time it happens (if it does,
I understand it's rare) calling the hot threads api would really help
diagnosing it (
http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads/)

Cheers,
Boaz

On Wed, Jul 10, 2013 at 11:50 AM, Vadim Kisselmann <v.kiss...@gmail.com<javascript:>

wrote:

Hi Boaz,

thanks for your reply.

  1. It's the default setting, 3 nodes: 5 shards x 1 replica per index.
  2. It was the master(high cpu load, and only the cpu, ram, hdd i/o,
    network, everything was fine).
    After investigating everything like tomcat-logs from my
    services(connections, errors), settings and so on, i found nothing
    suspicious. Everything
    is like in the past months.
    I have only one idea: ES has an bug in this old version (19.11) and
    something caused an endless loop. Because only the cpu load was at 100% on
    all 8 cores, but nothing else on this machine.

Cheers,
Vadim

Am Dienstag, 9. Juli 2013 15:35:54 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

Can you say a bit more about your cluster setup?

  1. How many primary shards you had per index? How many replicas?
  2. Was the node that experienced high cpu load also the cluster master
    at them? (you can see in the logs which node was elected master).

Cheers,
Boaz

On Tuesday, July 9, 2013 9:44:07 AM UTC+2, Vadim Kisselmann wrote:

Hi folks,

our cluster was "crashed" this night.
We have a couple symptoms and try to narrow down the problem.
Our setup: 3 Nodes in AWS, ES version 19.11, 4 indices for different
services.
Master was node1. The cpu load of this node is rised suddenly to 100%
from 3:00 to 3:30.
Other nodes cpu load was small. Logs are empty. It was only the cpu
load, memory consumption, network, etc.everything was normal.
Services which wanted to connect to their indices are timed out after
one minute with no response.

What is happened here? Could a "slow" query form only one service be a
trigger for this? What about the other nodes in the cluster, why they did
not provide
any results for other services from indices which are still working(on
node 2 & 3)?

The full cluster restart was the only solution for this.
But how can we prevent this case(one node down, the whole cluster does
not answer)?

Cheers,
Vadim

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Vadim,
Do u have any disk activity?
We had similar cases in AWS where nodes would peg CPU and usually get bound
to io too.
I changed the cluster so that we have 3 master nodes , with no data stored
in them, on smaller instances. Then your data nodes, all of them obviously
configured with master= false. The app servers speak to the masters only,
via load balancers. On one hand, this smoothed out crazy spikes and all
nodes are pretty much loaded quite evenly, and haven't seen a case where a
node gets locked , or worse , a brain split.
As always, YMMV .
On 15/07/2013 7:09 PM, "Vadim Kisselmann" v.kisselmann@gmail.com wrote:

Hi Boaz,

we had an bigger crash last weekend, and now we have problems to rebalance
our cluster. Suspiciously again at 3am.
The master is fully loaded with 100% cpu, this seems to block disk and
network on AWS, because the other nodes don't replicate anything. You can
see with atop, that disk reads/writes on master are at 0, MBr/s is between
0-1MB.
Hot_threads an master are busy with over 100% CPU load. It's weird.
https://gist.github.com/vkisselmann/5998537

Cheers,
Vadim

Am Donnerstag, 11. Juli 2013 21:37:57 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

I don't know of any bug that causes such symptoms but you never know. It
may also be other stuff like scripts etc. Next time it happens (if it does,
I understand it's rare) calling the hot threads api would really help
diagnosing it ( http://www.elasticsearch.org/**guide/reference/api/admin-
**cluster-nodes-hot-threads/http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads/)

Cheers,
Boaz

On Wed, Jul 10, 2013 at 11:50 AM, Vadim Kisselmann v.kiss...@gmail.comwrote:

Hi Boaz,

thanks for your reply.

  1. It's the default setting, 3 nodes: 5 shards x 1 replica per index.
  2. It was the master(high cpu load, and only the cpu, ram, hdd i/o,
    network, everything was fine).
    After investigating everything like tomcat-logs from my
    services(connections, errors), settings and so on, i found nothing
    suspicious. Everything
    is like in the past months.
    I have only one idea: ES has an bug in this old version (19.11) and
    something caused an endless loop. Because only the cpu load was at 100% on
    all 8 cores, but nothing else on this machine.

Cheers,
Vadim

Am Dienstag, 9. Juli 2013 15:35:54 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

Can you say a bit more about your cluster setup?

  1. How many primary shards you had per index? How many replicas?
  2. Was the node that experienced high cpu load also the cluster master
    at them? (you can see in the logs which node was elected master).

Cheers,
Boaz

On Tuesday, July 9, 2013 9:44:07 AM UTC+2, Vadim Kisselmann wrote:

Hi folks,

our cluster was "crashed" this night.
We have a couple symptoms and try to narrow down the problem.
Our setup: 3 Nodes in AWS, ES version 19.11, 4 indices for different
services.
Master was node1. The cpu load of this node is rised suddenly to 100%
from 3:00 to 3:30.
Other nodes cpu load was small. Logs are empty. It was only the cpu
load, memory consumption, network, etc.everything was normal.
Services which wanted to connect to their indices are timed out after
one minute with no response.

What is happened here? Could a "slow" query form only one service be a
trigger for this? What about the other nodes in the cluster, why they did
not provide
any results for other services from indices which are still working(on
node 2 & 3)?

The full cluster restart was the only solution for this.
But how can we prevent this case(one node down, the whole cluster does
not answer)?

Cheers,
Vadim

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**89DaUKYrw4s/unsubscribehttps://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Vadim,

Norberto's suggestion will help to keep the cluster stable, even in the
case when one of the nodes becomes overloaded. What seems to be happening
is that you master node is being under enough pressure for it to stop
fulfilling it's role correctly. I looked at your gist and I have a couple
of questions:

  1. At least thread is very busy parsing a search request json - does this
    make sense to you? do you have huge queries?
  2. The node is also busy recovering shards. this maybe a symptom rather
    than the cause of things, but can you set the indices.recovery log to debug
    level?
  3. What type of nodes are you running? How much memory is ES using?

Cheers,
Boaz

On Mon, Jul 15, 2013 at 11:35 AM, Norberto Meijome numard@gmail.com wrote:

Hi Vadim,
Do u have any disk activity?
We had similar cases in AWS where nodes would peg CPU and usually get
bound to io too.
I changed the cluster so that we have 3 master nodes , with no data stored
in them, on smaller instances. Then your data nodes, all of them obviously
configured with master= false. The app servers speak to the masters only,
via load balancers. On one hand, this smoothed out crazy spikes and all
nodes are pretty much loaded quite evenly, and haven't seen a case where a
node gets locked , or worse , a brain split.
As always, YMMV .
On 15/07/2013 7:09 PM, "Vadim Kisselmann" v.kisselmann@gmail.com wrote:

Hi Boaz,

we had an bigger crash last weekend, and now we have problems to
rebalance our cluster. Suspiciously again at 3am.
The master is fully loaded with 100% cpu, this seems to block disk and
network on AWS, because the other nodes don't replicate anything. You can
see with atop, that disk reads/writes on master are at 0, MBr/s is
between 0-1MB.
Hot_threads an master are busy with over 100% CPU load. It's weird.
https://gist.github.com/vkisselmann/5998537

Cheers,
Vadim

Am Donnerstag, 11. Juli 2013 21:37:57 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

I don't know of any bug that causes such symptoms but you never know. It
may also be other stuff like scripts etc. Next time it happens (if it does,
I understand it's rare) calling the hot threads api would really help
diagnosing it ( http://www.elasticsearch.org/**
guide/reference/api/admin-**cluster-nodes-hot-threads/http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads/)

Cheers,
Boaz

On Wed, Jul 10, 2013 at 11:50 AM, Vadim Kisselmann v.kiss...@gmail.comwrote:

Hi Boaz,

thanks for your reply.

  1. It's the default setting, 3 nodes: 5 shards x 1 replica per index.
  2. It was the master(high cpu load, and only the cpu, ram, hdd i/o,
    network, everything was fine).
    After investigating everything like tomcat-logs from my
    services(connections, errors), settings and so on, i found nothing
    suspicious. Everything
    is like in the past months.
    I have only one idea: ES has an bug in this old version (19.11) and
    something caused an endless loop. Because only the cpu load was at 100% on
    all 8 cores, but nothing else on this machine.

Cheers,
Vadim

Am Dienstag, 9. Juli 2013 15:35:54 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

Can you say a bit more about your cluster setup?

  1. How many primary shards you had per index? How many replicas?
  2. Was the node that experienced high cpu load also the cluster master
    at them? (you can see in the logs which node was elected master).

Cheers,
Boaz

On Tuesday, July 9, 2013 9:44:07 AM UTC+2, Vadim Kisselmann wrote:

Hi folks,

our cluster was "crashed" this night.
We have a couple symptoms and try to narrow down the problem.
Our setup: 3 Nodes in AWS, ES version 19.11, 4 indices for different
services.
Master was node1. The cpu load of this node is rised suddenly to 100%
from 3:00 to 3:30.
Other nodes cpu load was small. Logs are empty. It was only the cpu
load, memory consumption, network, etc.everything was normal.
Services which wanted to connect to their indices are timed out after
one minute with no response.

What is happened here? Could a "slow" query form only one service be
a trigger for this? What about the other nodes in the cluster, why they did
not provide
any results for other services from indices which are still
working(on node 2 & 3)?

The full cluster restart was the only solution for this.
But how can we prevent this case(one node down, the whole cluster
does not answer)?

Cheers,
Vadim

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**89DaUKYrw4s/unsubscribehttps://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On 15/07/2013 9:14 PM, "Boaz Leskes" b.leskes@gmail.com wrote:

Hi Vadim,

Norberto's suggestion will help to keep the cluster stable, even in the
case when one of the nodes becomes overloaded. What seems to be happening
is that you master node is being under enough pressure for it to stop
fulfilling it's role correctly.

That definitely seem to be the case.

B

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Guys,
thanks for your reply!
Norberto's solutions sound interesting. It's a good way to keep the cluster
stable.
And yes, it's definitely the case, that the master is stopping fullfilling
his work. This node was fully blocked.

@Norberto
We had no disk activity at these moment, only the high cpu usage on master.

@Boaz

  1. Yes, we have bigger facet queries. In pretty formatted JSON about 1000
    lines.
  2. We can try to set indices.recovery to debug level, but we have to wait
    for the next crash
  3. We have an splitted hardware setup now on AWS, 2x c1.xlarge instances
    (7GB RAM) and one with 15GB RAM(m1.xlarge). The last one is the master yet.
    The full index is not really big (30GB).

We restarted the both c1.xlarge instances today in the morning, because
both are running out of memory during the try to rebalance the cluster(this
causes that the whole cluster is not processing queries anymore(!),
although the master had no problems with RAM and especially on this time no
cpu problems):

[2013-07-15 09:45:23,538][INFO ][monitor.jvm] [esearch.cloud]
[gc][ConcurrentMarkSweep][640][163] duration [5.3s], collections
[1]/[5.5s], total [5.3s]/[8.5m], memory [3.8gb]->[3.8gb]/[3.9gb], all_pools
{[Code Cache] [6.4mb]->[6.4mb]/[48mb]}{[Par Eden Space]
[532.5mb]->[532.5mb]/[532.5mb]}{[Par Survivor Space]
[9.5mb]->[6.2mb]/[66.5mb]}{[CMS Old Gen] [3.3gb]->[3.3gb]/[3.3gb]}{[CMS
Perm Gen] [35.3mb]->[35.3mb]/[82mb]}

See attached stacktrace from one c1.xlarge instance from this time.

The problem in my first mail(100% cpu) seems to be an different problem,
but it's the trigger for our slave-nodes to become crazy, so at the end the
whole cluster is unresponsive to answer requests.

Vadim

Am Montag, 15. Juli 2013 13:36:12 UTC+2 schrieb Norberto Meijome:

On 15/07/2013 9:14 PM, "Boaz Leskes" <b.le...@gmail.com <javascript:>>
wrote:

Hi Vadim,

Norberto's suggestion will help to keep the cluster stable, even in the
case when one of the nodes becomes overloaded. What seems to be happening
is that you master node is being under enough pressure for it to stop
fulfilling it's role correctly.

That definitely seem to be the case.

B

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

fwiw, the nodes u use as master/balancers use little memory and cpu
compared to the data instances. Almost no IO constraints either ( in some
cases, we even merge the LB and the ES master nodes with no problem).

On Tue, Jul 16, 2013 at 12:39 AM, Vadim Kisselmann
v.kisselmann@gmail.comwrote:

Hi Guys,
thanks for your reply!
Norberto's solutions sound interesting. It's a good way to keep the
cluster stable.
And yes, it's definitely the case, that the master is stopping fullfilling
his work. This node was fully blocked.

@Norberto
We had no disk activity at these moment, only the high cpu usage on
master.

@Boaz

  1. Yes, we have bigger facet queries. In pretty formatted JSON about 1000
    lines.
  2. We can try to set indices.recovery to debug level, but we have to wait
    for the next crash
  3. We have an splitted hardware setup now on AWS, 2x c1.xlarge instances
    (7GB RAM) and one with 15GB RAM(m1.xlarge). The last one is the master yet.
    The full index is not really big (30GB).

We restarted the both c1.xlarge instances today in the morning, because
both are running out of memory during the try to rebalance the cluster(this
causes that the whole cluster is not processing queries anymore(!),
although the master had no problems with RAM and especially on this time no
cpu problems):

[2013-07-15 09:45:23,538][INFO ][monitor.jvm] [esearch.cloud]
[gc][ConcurrentMarkSweep][640][163] duration [5.3s], collections
[1]/[5.5s], total [5.3s]/[8.5m], memory [3.8gb]->[3.8gb]/[3.9gb], all_pools
{[Code Cache] [6.4mb]->[6.4mb]/[48mb]}{[Par Eden Space]
[532.5mb]->[532.5mb]/[532.5mb]}{[Par Survivor Space]
[9.5mb]->[6.2mb]/[66.5mb]}{[CMS Old Gen] [3.3gb]->[3.3gb]/[3.3gb]}{[CMS
Perm Gen] [35.3mb]->[35.3mb]/[82mb]}

See attached stacktrace from one c1.xlarge instance from this time.

The problem in my first mail(100% cpu) seems to be an different problem,
but it's the trigger for our slave-nodes to become crazy, so at the end the
whole cluster is unresponsive to answer requests.

Vadim

Am Montag, 15. Juli 2013 13:36:12 UTC+2 schrieb Norberto Meijome:

On 15/07/2013 9:14 PM, "Boaz Leskes" b.le...@gmail.com wrote:

Hi Vadim,

Norberto's suggestion will help to keep the cluster stable, even in the
case when one of the nodes becomes overloaded. What seems to be happening
is that you master node is being under enough pressure for it to stop
fulfilling it's role correctly.

That definitely seem to be the case.

B

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Norberto 'Beto' Meijome

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Norberto,
thanks for the info :slight_smile:

I attached a New Relic screenshot for one node(not master) which hangs
today in the morning.
It's a weird behavior. Only one core(we have 8 on this machine) hangs today
with one "search" thread on 100%.
In screenshot you can see this betweeen 9:00-9:30 (this straight line
instead of peaks). In this time the whole cluster was not recheable for our
services.

Cheers,
Vadim

Am Montag, 15. Juli 2013 16:46:58 UTC+2 schrieb Norberto Meijome:

fwiw, the nodes u use as master/balancers use little memory and cpu
compared to the data instances. Almost no IO constraints either ( in some
cases, we even merge the LB and the ES master nodes with no problem).

On Tue, Jul 16, 2013 at 12:39 AM, Vadim Kisselmann <v.kiss...@gmail.com<javascript:>

wrote:

Hi Guys,
thanks for your reply!
Norberto's solutions sound interesting. It's a good way to keep the
cluster stable.
And yes, it's definitely the case, that the master is stopping
fullfilling his work. This node was fully blocked.

@Norberto
We had no disk activity at these moment, only the high cpu usage on
master.

@Boaz

  1. Yes, we have bigger facet queries. In pretty formatted JSON about 1000
    lines.
  2. We can try to set indices.recovery to debug level, but we have to wait
    for the next crash
  3. We have an splitted hardware setup now on AWS, 2x c1.xlarge instances
    (7GB RAM) and one with 15GB RAM(m1.xlarge). The last one is the master yet.
    The full index is not really big (30GB).

We restarted the both c1.xlarge instances today in the morning, because
both are running out of memory during the try to rebalance the cluster(this
causes that the whole cluster is not processing queries anymore(!),
although the master had no problems with RAM and especially on this time no
cpu problems):

[2013-07-15 09:45:23,538][INFO ][monitor.jvm] [esearch.cloud]
[gc][ConcurrentMarkSweep][640][163] duration [5.3s], collections
[1]/[5.5s], total [5.3s]/[8.5m], memory [3.8gb]->[3.8gb]/[3.9gb], all_pools
{[Code Cache] [6.4mb]->[6.4mb]/[48mb]}{[Par Eden Space]
[532.5mb]->[532.5mb]/[532.5mb]}{[Par Survivor Space]
[9.5mb]->[6.2mb]/[66.5mb]}{[CMS Old Gen] [3.3gb]->[3.3gb]/[3.3gb]}{[CMS
Perm Gen] [35.3mb]->[35.3mb]/[82mb]}

See attached stacktrace from one c1.xlarge instance from this time.

The problem in my first mail(100% cpu) seems to be an different problem,
but it's the trigger for our slave-nodes to become crazy, so at the end the
whole cluster is unresponsive to answer requests.

Vadim

Am Montag, 15. Juli 2013 13:36:12 UTC+2 schrieb Norberto Meijome:

On 15/07/2013 9:14 PM, "Boaz Leskes" b.le...@gmail.com wrote:

Hi Vadim,

Norberto's suggestion will help to keep the cluster stable, even in
the case when one of the nodes becomes overloaded. What seems to be
happening is that you master node is being under enough pressure for it to
stop fulfilling it's role correctly.

That definitely seem to be the case.

B

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Norberto 'Beto' Meijome

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Vadim,

Any chance you can trace the json of the query that hangs?

Cheers,
Boaz

On Thu, Jul 18, 2013 at 10:53 AM, Vadim Kisselmann
v.kisselmann@gmail.comwrote:

Hi Norberto,
thanks for the info :slight_smile:

I attached a New Relic screenshot for one node(not master) which hangs
today in the morning.
It's a weird behavior. Only one core(we have 8 on this machine) hangs
today with one "search" thread on 100%.
In screenshot you can see this betweeen 9:00-9:30 (this straight line
instead of peaks). In this time the whole cluster was not recheable for our
services.

Cheers,
Vadim

Am Montag, 15. Juli 2013 16:46:58 UTC+2 schrieb Norberto Meijome:

fwiw, the nodes u use as master/balancers use little memory and cpu
compared to the data instances. Almost no IO constraints either ( in some
cases, we even merge the LB and the ES master nodes with no problem).

On Tue, Jul 16, 2013 at 12:39 AM, Vadim Kisselmann v.kiss...@gmail.comwrote:

Hi Guys,
thanks for your reply!
Norberto's solutions sound interesting. It's a good way to keep the
cluster stable.
And yes, it's definitely the case, that the master is stopping
fullfilling his work. This node was fully blocked.

@Norberto
We had no disk activity at these moment, only the high cpu usage on
master.

@Boaz

  1. Yes, we have bigger facet queries. In pretty formatted JSON about
    1000 lines.
  2. We can try to set indices.recovery to debug level, but we have to
    wait for the next crash
  3. We have an splitted hardware setup now on AWS, 2x c1.xlarge instances
    (7GB RAM) and one with 15GB RAM(m1.xlarge). The last one is the master yet.
    The full index is not really big (30GB).

We restarted the both c1.xlarge instances today in the morning, because
both are running out of memory during the try to rebalance the cluster(this
causes that the whole cluster is not processing queries anymore(!),
although the master had no problems with RAM and especially on this time no
cpu problems):

[2013-07-15 09:45:23,538][INFO ][monitor.jvm] [esearch.cloud]
[gc][ConcurrentMarkSweep][640][163] duration [5.3s], collections
[1]/[5.5s], total [5.3s]/[8.5m], memory [3.8gb]->[3.8gb]/[3.9gb], all_pools
{[Code Cache] [6.4mb]->[6.4mb]/[48mb]}{[Par Eden Space]
[532.5mb]->[532.5mb]/[532.5mb]
}{[Par Survivor Space]
[9.5mb]->[6.2mb]/[66.5mb]}{[**CMS Old Gen]
[3.3gb]->[3.3gb]/[3.3gb]}{[CMS Perm Gen] [35.3mb]->[35.3mb]/[82mb]}

See attached stacktrace from one c1.xlarge instance from this time.

The problem in my first mail(100% cpu) seems to be an different problem,
but it's the trigger for our slave-nodes to become crazy, so at the end the
whole cluster is unresponsive to answer requests.

Vadim

Am Montag, 15. Juli 2013 13:36:12 UTC+2 schrieb Norberto Meijome:

On 15/07/2013 9:14 PM, "Boaz Leskes" b.le...@gmail.com wrote:

Hi Vadim,

Norberto's suggestion will help to keep the cluster stable, even in
the case when one of the nodes becomes overloaded. What seems to be
happening is that you master node is being under enough pressure for it to
stop fulfilling it's role correctly.

That definitely seem to be the case.

B

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
Norberto 'Beto' Meijome

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Boaz,

now 2 of our nodes are crashed, with same behavior.
I couldn't see the hot_threads, because no requests are possible at that
time.
But i could get an jstack from my elasticsearch process before restart, see
attachment.
Many BLOCKED threads, and a couple IN_NATIVE state with
EPollArrayWrapper.epollWait and receive0.

To trace our queries could be a pain. They all are fast(under 300ms. The
facet queries, too), so slowlog is not an option. I can
log every query, but we have over 100 per second, i think our disks are not
big enough for this:)
And we have the same queries for months, we changed nothing. We got only
slightly more data.

Cheers,
Vadim

Am Donnerstag, 18. Juli 2013 11:44:56 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

Any chance you can trace the json of the query that hangs?

Cheers,
Boaz

On Thu, Jul 18, 2013 at 10:53 AM, Vadim Kisselmann <v.kiss...@gmail.com<javascript:>

wrote:

Hi Norberto,
thanks for the info :slight_smile:

I attached a New Relic screenshot for one node(not master) which hangs
today in the morning.
It's a weird behavior. Only one core(we have 8 on this machine) hangs
today with one "search" thread on 100%.
In screenshot you can see this betweeen 9:00-9:30 (this straight line
instead of peaks). In this time the whole cluster was not recheable for our
services.

Cheers,
Vadim

Am Montag, 15. Juli 2013 16:46:58 UTC+2 schrieb Norberto Meijome:

fwiw, the nodes u use as master/balancers use little memory and cpu
compared to the data instances. Almost no IO constraints either ( in some
cases, we even merge the LB and the ES master nodes with no problem).

On Tue, Jul 16, 2013 at 12:39 AM, Vadim Kisselmann v.kiss...@gmail.comwrote:

Hi Guys,
thanks for your reply!
Norberto's solutions sound interesting. It's a good way to keep the
cluster stable.
And yes, it's definitely the case, that the master is stopping
fullfilling his work. This node was fully blocked.

@Norberto
We had no disk activity at these moment, only the high cpu usage on
master.

@Boaz

  1. Yes, we have bigger facet queries. In pretty formatted JSON about
    1000 lines.
  2. We can try to set indices.recovery to debug level, but we have to
    wait for the next crash
  3. We have an splitted hardware setup now on AWS, 2x c1.xlarge
    instances (7GB RAM) and one with 15GB RAM(m1.xlarge). The last one is the
    master yet. The full index is not really big (30GB).

We restarted the both c1.xlarge instances today in the morning, because
both are running out of memory during the try to rebalance the cluster(this
causes that the whole cluster is not processing queries anymore(!),
although the master had no problems with RAM and especially on this time no
cpu problems):

[2013-07-15 09:45:23,538][INFO ][monitor.jvm] [esearch.cloud]
[gc][ConcurrentMarkSweep][640][163] duration [5.3s], collections
[1]/[5.5s], total [5.3s]/[8.5m], memory [3.8gb]->[3.8gb]/[3.9gb], all_pools
{[Code Cache] [6.4mb]->[6.4mb]/[48mb]}{[Par Eden Space]
[532.5mb]->[532.5mb]/[532.5mb]
}{[Par Survivor Space]
[9.5mb]->[6.2mb]/[66.5mb]}{[**CMS Old Gen]
[3.3gb]->[3.3gb]/[3.3gb]}{[CMS Perm Gen] [35.3mb]->[35.3mb]/[82mb]}

See attached stacktrace from one c1.xlarge instance from this time.

The problem in my first mail(100% cpu) seems to be an different
problem, but it's the trigger for our slave-nodes to become crazy, so at
the end the whole cluster is unresponsive to answer requests.

Vadim

Am Montag, 15. Juli 2013 13:36:12 UTC+2 schrieb Norberto Meijome:

On 15/07/2013 9:14 PM, "Boaz Leskes" b.le...@gmail.com wrote:

Hi Vadim,

Norberto's suggestion will help to keep the cluster stable, even in
the case when one of the nodes becomes overloaded. What seems to be
happening is that you master node is being under enough pressure for it to
stop fulfilling it's role correctly.

That definitely seem to be the case.

B

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
Norberto 'Beto' Meijome

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Vadim,

Strange. In this dump I don't see any threads parsing JSON, like in the
earlier case.

Is there any chance you can upgrade your cluster to 0.90.2? I hate do fall
back to such a standard solution, but it may help. Also, it will make life
easier when tracing down this kind of things.

Cheers,
Boaz

On Thu, Jul 18, 2013 at 5:06 PM, Vadim Kisselmann v.kisselmann@gmail.comwrote:

Hi Boaz,

now 2 of our nodes are crashed, with same behavior.
I couldn't see the hot_threads, because no requests are possible at that
time.
But i could get an jstack from my elasticsearch process before restart,
see attachment.
Many BLOCKED threads, and a couple IN_NATIVE state with
EPollArrayWrapper.epollWait and receive0.

To trace our queries could be a pain. They all are fast(under 300ms. The
facet queries, too), so slowlog is not an option. I can
log every query, but we have over 100 per second, i think our disks are
not big enough for this:)
And we have the same queries for months, we changed nothing. We got only
slightly more data.

Cheers,
Vadim

Am Donnerstag, 18. Juli 2013 11:44:56 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

Any chance you can trace the json of the query that hangs?

Cheers,
Boaz

On Thu, Jul 18, 2013 at 10:53 AM, Vadim Kisselmann v.kiss...@gmail.comwrote:

Hi Norberto,
thanks for the info :slight_smile:

I attached a New Relic screenshot for one node(not master) which hangs
today in the morning.
It's a weird behavior. Only one core(we have 8 on this machine) hangs
today with one "search" thread on 100%.
In screenshot you can see this betweeen 9:00-9:30 (this straight line
instead of peaks). In this time the whole cluster was not recheable for our
services.

Cheers,
Vadim

Am Montag, 15. Juli 2013 16:46:58 UTC+2 schrieb Norberto Meijome:

fwiw, the nodes u use as master/balancers use little memory and cpu
compared to the data instances. Almost no IO constraints either ( in some
cases, we even merge the LB and the ES master nodes with no problem).

On Tue, Jul 16, 2013 at 12:39 AM, Vadim Kisselmann <v.kiss...@gmail.com

wrote:

Hi Guys,
thanks for your reply!
Norberto's solutions sound interesting. It's a good way to keep the
cluster stable.
And yes, it's definitely the case, that the master is stopping
fullfilling his work. This node was fully blocked.

@Norberto
We had no disk activity at these moment, only the high cpu usage on
master.

@Boaz

  1. Yes, we have bigger facet queries. In pretty formatted JSON about
    1000 lines.
  2. We can try to set indices.recovery to debug level, but we have to
    wait for the next crash
  3. We have an splitted hardware setup now on AWS, 2x c1.xlarge
    instances (7GB RAM) and one with 15GB RAM(m1.xlarge). The last one is the
    master yet. The full index is not really big (30GB).

We restarted the both c1.xlarge instances today in the morning,
because both are running out of memory during the try to rebalance the
cluster(this causes that the whole cluster is not processing queries
anymore(!), although the master had no problems with RAM and especially on
this time no cpu problems):

[2013-07-15 09:45:23,538][INFO ][monitor.jvm] [esearch.cloud]
[gc][ConcurrentMarkSweep][640][163] duration [5.3s], collections
[1]/[5.5s], total [5.3s]/[8.5m], memory [3.8gb]->[3.8gb]/[3.9gb], all_pools
{[Code Cache] [6.4mb]->[6.4mb]/[48mb]}{[Par Eden Space]
[532.5mb]->[532.5mb]/[532.5mb]
}{[Par Survivor Space]
[9.5mb]->[6.2mb]/[66.5mb]}{[CMS Old Gen]
[3.3gb]->[3.3gb]/[3.3gb]}{[CMS Perm Gen] [35.3mb]->[35.3mb]/[82mb]}

See attached stacktrace from one c1.xlarge instance from this time.

The problem in my first mail(100% cpu) seems to be an different
problem, but it's the trigger for our slave-nodes to become crazy, so at
the end the whole cluster is unresponsive to answer requests.

Vadim

Am Montag, 15. Juli 2013 13:36:12 UTC+2 schrieb Norberto Meijome:

On 15/07/2013 9:14 PM, "Boaz Leskes" b.le...@gmail.com wrote:

Hi Vadim,

Norberto's suggestion will help to keep the cluster stable, even in
the case when one of the nodes becomes overloaded. What seems to be
happening is that you master node is being under enough pressure for it to
stop fulfilling it's role correctly.

That definitely seem to be the case.

B

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
Norberto 'Beto' Meijome

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**89DaUKYrw4s/unsubscribehttps://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Boaz,
it's on our roadmap :slight_smile: We hoped to upgrade till the end of 2013, because
it's a big change for our infrastructure, but now it's urgent.
Is it better to upgrade to an stable 0.20.6, or is 0.90.2 stable enough?
Cheers,
Vadim

Am Donnerstag, 18. Juli 2013 20:56:40 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

Strange. In this dump I don't see any threads parsing JSON, like in the
earlier case.

Is there any chance you can upgrade your cluster to 0.90.2? I hate do fall
back to such a standard solution, but it may help. Also, it will make life
easier when tracing down this kind of things.

Cheers,
Boaz

On Thu, Jul 18, 2013 at 5:06 PM, Vadim Kisselmann <v.kiss...@gmail.com<javascript:>

wrote:

Hi Boaz,

now 2 of our nodes are crashed, with same behavior.
I couldn't see the hot_threads, because no requests are possible at that
time.
But i could get an jstack from my elasticsearch process before restart,
see attachment.
Many BLOCKED threads, and a couple IN_NATIVE state with
EPollArrayWrapper.epollWait and receive0.

To trace our queries could be a pain. They all are fast(under 300ms. The
facet queries, too), so slowlog is not an option. I can
log every query, but we have over 100 per second, i think our disks are
not big enough for this:)
And we have the same queries for months, we changed nothing. We got only
slightly more data.

Cheers,
Vadim

Am Donnerstag, 18. Juli 2013 11:44:56 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

Any chance you can trace the json of the query that hangs?

Cheers,
Boaz

On Thu, Jul 18, 2013 at 10:53 AM, Vadim Kisselmann v.kiss...@gmail.comwrote:

Hi Norberto,
thanks for the info :slight_smile:

I attached a New Relic screenshot for one node(not master) which hangs
today in the morning.
It's a weird behavior. Only one core(we have 8 on this machine) hangs
today with one "search" thread on 100%.
In screenshot you can see this betweeen 9:00-9:30 (this straight line
instead of peaks). In this time the whole cluster was not recheable for our
services.

Cheers,
Vadim

Am Montag, 15. Juli 2013 16:46:58 UTC+2 schrieb Norberto Meijome:

fwiw, the nodes u use as master/balancers use little memory and cpu
compared to the data instances. Almost no IO constraints either ( in some
cases, we even merge the LB and the ES master nodes with no problem).

On Tue, Jul 16, 2013 at 12:39 AM, Vadim Kisselmann <
v.kiss...@gmail.com> wrote:

Hi Guys,
thanks for your reply!
Norberto's solutions sound interesting. It's a good way to keep the
cluster stable.
And yes, it's definitely the case, that the master is stopping
fullfilling his work. This node was fully blocked.

@Norberto
We had no disk activity at these moment, only the high cpu usage on
master.

@Boaz

  1. Yes, we have bigger facet queries. In pretty formatted JSON about
    1000 lines.
  2. We can try to set indices.recovery to debug level, but we have to
    wait for the next crash
  3. We have an splitted hardware setup now on AWS, 2x c1.xlarge
    instances (7GB RAM) and one with 15GB RAM(m1.xlarge). The last one is the
    master yet. The full index is not really big (30GB).

We restarted the both c1.xlarge instances today in the morning,
because both are running out of memory during the try to rebalance the
cluster(this causes that the whole cluster is not processing queries
anymore(!), although the master had no problems with RAM and especially on
this time no cpu problems):

[2013-07-15 09:45:23,538][INFO ][monitor.jvm] [esearch.cloud]
[gc][ConcurrentMarkSweep][640][163] duration [5.3s], collections
[1]/[5.5s], total [5.3s]/[8.5m], memory [3.8gb]->[3.8gb]/[3.9gb], all_pools
{[Code Cache] [6.4mb]->[6.4mb]/[48mb]}{[Par Eden Space]
[532.5mb]->[532.5mb]/[532.5mb]
}{[Par Survivor Space]
[9.5mb]->[6.2mb]/[66.5mb]}{[CMS Old Gen]
[3.3gb]->[3.3gb]/[3.3gb]}{[CMS Perm Gen] [35.3mb]->[35.3mb]/[82mb]}

See attached stacktrace from one c1.xlarge instance from this time.

The problem in my first mail(100% cpu) seems to be an different
problem, but it's the trigger for our slave-nodes to become crazy, so at
the end the whole cluster is unresponsive to answer requests.

Vadim

Am Montag, 15. Juli 2013 13:36:12 UTC+2 schrieb Norberto Meijome:

On 15/07/2013 9:14 PM, "Boaz Leskes" b.le...@gmail.com wrote:

Hi Vadim,

Norberto's suggestion will help to keep the cluster stable, even
in the case when one of the nodes becomes overloaded. What seems to be
happening is that you master node is being under enough pressure for it to
stop fulfilling it's role correctly.

That definitely seem to be the case.

B

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
Norberto 'Beto' Meijome

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**89DaUKYrw4s/unsubscribehttps://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

0.90.2 is stable enough. It will also greatly reduce the memory signature.
It is very much recommended to use that instead of 0.20.6. Also make sure
you upgrade to the latest java version

Boaz

On Fri, Jul 19, 2013 at 9:33 AM, Vadim Kisselmann v.kisselmann@gmail.comwrote:

Hi Boaz,
it's on our roadmap :slight_smile: We hoped to upgrade till the end of 2013, because
it's a big change for our infrastructure, but now it's urgent.
Is it better to upgrade to an stable 0.20.6, or is 0.90.2 stable enough?
Cheers,
Vadim

Am Donnerstag, 18. Juli 2013 20:56:40 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

Strange. In this dump I don't see any threads parsing JSON, like in the
earlier case.

Is there any chance you can upgrade your cluster to 0.90.2? I hate do
fall back to such a standard solution, but it may help. Also, it will make
life easier when tracing down this kind of things.

Cheers,
Boaz

On Thu, Jul 18, 2013 at 5:06 PM, Vadim Kisselmann v.kiss...@gmail.comwrote:

Hi Boaz,

now 2 of our nodes are crashed, with same behavior.
I couldn't see the hot_threads, because no requests are possible at that
time.
But i could get an jstack from my elasticsearch process before restart,
see attachment.
Many BLOCKED threads, and a couple IN_NATIVE state with
EPollArrayWrapper.epollWait and receive0.

To trace our queries could be a pain. They all are fast(under 300ms. The
facet queries, too), so slowlog is not an option. I can
log every query, but we have over 100 per second, i think our disks are
not big enough for this:)
And we have the same queries for months, we changed nothing. We got only
slightly more data.

Cheers,
Vadim

Am Donnerstag, 18. Juli 2013 11:44:56 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

Any chance you can trace the json of the query that hangs?

Cheers,
Boaz

On Thu, Jul 18, 2013 at 10:53 AM, Vadim Kisselmann <v.kiss...@gmail.com

wrote:

Hi Norberto,
thanks for the info :slight_smile:

I attached a New Relic screenshot for one node(not master) which hangs
today in the morning.
It's a weird behavior. Only one core(we have 8 on this machine) hangs
today with one "search" thread on 100%.
In screenshot you can see this betweeen 9:00-9:30 (this straight line
instead of peaks). In this time the whole cluster was not recheable for our
services.

Cheers,
Vadim

Am Montag, 15. Juli 2013 16:46:58 UTC+2 schrieb Norberto Meijome:

fwiw, the nodes u use as master/balancers use little memory and cpu
compared to the data instances. Almost no IO constraints either ( in some
cases, we even merge the LB and the ES master nodes with no problem).

On Tue, Jul 16, 2013 at 12:39 AM, Vadim Kisselmann <
v.kiss...@gmail.com> wrote:

Hi Guys,
thanks for your reply!
Norberto's solutions sound interesting. It's a good way to keep the
cluster stable.
And yes, it's definitely the case, that the master is stopping
fullfilling his work. This node was fully blocked.

@Norberto
We had no disk activity at these moment, only the high cpu usage on
master.

@Boaz

  1. Yes, we have bigger facet queries. In pretty formatted JSON about
    1000 lines.
  2. We can try to set indices.recovery to debug level, but we have to
    wait for the next crash
  3. We have an splitted hardware setup now on AWS, 2x c1.xlarge
    instances (7GB RAM) and one with 15GB RAM(m1.xlarge). The last one is the
    master yet. The full index is not really big (30GB).

We restarted the both c1.xlarge instances today in the morning,
because both are running out of memory during the try to rebalance the
cluster(this causes that the whole cluster is not processing queries
anymore(!), although the master had no problems with RAM and especially on
this time no cpu problems):

[2013-07-15 09:45:23,538][INFO ][monitor.jvm] [esearch.cloud]
[gc][ConcurrentMarkSweep][640]******[163] duration [5.3s],
collections [1]/[5.5s], total [5.3s]/[8.5m], memory
[3.8gb]->[3.8gb]/[3.9gb], all_pools {[Code Cache]
[6.4mb]->[6.4mb]/[48mb]}{[Par Eden Space] [532.5mb]->[532.5mb]/[532.5mb]
**}{[Par Survivor Space] [9.5mb]->[6.2mb]/[66.5mb]}{[**CMS
Old Gen] [3.3gb]->[3.3gb]/[3.3gb]}{[CMS Perm Gen]
[35.3mb]->[35.3mb]/[82mb]}

See attached stacktrace from one c1.xlarge instance from this time.

The problem in my first mail(100% cpu) seems to be an different
problem, but it's the trigger for our slave-nodes to become crazy, so at
the end the whole cluster is unresponsive to answer requests.

Vadim

Am Montag, 15. Juli 2013 13:36:12 UTC+2 schrieb Norberto Meijome:

On 15/07/2013 9:14 PM, "Boaz Leskes" b.le...@gmail.com wrote:

Hi Vadim,

Norberto's suggestion will help to keep the cluster stable, even
in the case when one of the nodes becomes overloaded. What seems to be
happening is that you master node is being under enough pressure for it to
stop fulfilling it's role correctly.

That definitely seem to be the case.

B

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/grou

ps/opt_out https://groups.google.com/groups/opt_out.

--
Norberto 'Beto' Meijome

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**to
**pic/elasticsearch/**89DaUKYrw4s/**unsubscribehttps://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**89DaUKYrw4s/unsubscribehttps://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for your help, Boaz:)
I will report if the problems are gone with the upgrade. I hope we can
manage this in the next weeks.
Cheers,
Vadim

Am Freitag, 19. Juli 2013 09:50:39 UTC+2 schrieb Boaz Leskes:

0.90.2 is stable enough. It will also greatly reduce the memory signature.
It is very much recommended to use that instead of 0.20.6. Also make sure
you upgrade to the latest java version

Boaz

On Fri, Jul 19, 2013 at 9:33 AM, Vadim Kisselmann <v.kiss...@gmail.com<javascript:>

wrote:

Hi Boaz,
it's on our roadmap :slight_smile: We hoped to upgrade till the end of 2013, because
it's a big change for our infrastructure, but now it's urgent.
Is it better to upgrade to an stable 0.20.6, or is 0.90.2 stable enough?
Cheers,
Vadim

Am Donnerstag, 18. Juli 2013 20:56:40 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

Strange. In this dump I don't see any threads parsing JSON, like in the
earlier case.

Is there any chance you can upgrade your cluster to 0.90.2? I hate do
fall back to such a standard solution, but it may help. Also, it will make
life easier when tracing down this kind of things.

Cheers,
Boaz

On Thu, Jul 18, 2013 at 5:06 PM, Vadim Kisselmann v.kiss...@gmail.comwrote:

Hi Boaz,

now 2 of our nodes are crashed, with same behavior.
I couldn't see the hot_threads, because no requests are possible at
that time.
But i could get an jstack from my elasticsearch process before restart,
see attachment.
Many BLOCKED threads, and a couple IN_NATIVE state with
EPollArrayWrapper.epollWait and receive0.

To trace our queries could be a pain. They all are fast(under 300ms.
The facet queries, too), so slowlog is not an option. I can
log every query, but we have over 100 per second, i think our disks are
not big enough for this:)
And we have the same queries for months, we changed nothing. We got
only slightly more data.

Cheers,
Vadim

Am Donnerstag, 18. Juli 2013 11:44:56 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

Any chance you can trace the json of the query that hangs?

Cheers,
Boaz

On Thu, Jul 18, 2013 at 10:53 AM, Vadim Kisselmann <
v.kiss...@gmail.com> wrote:

Hi Norberto,
thanks for the info :slight_smile:

I attached a New Relic screenshot for one node(not master) which
hangs today in the morning.
It's a weird behavior. Only one core(we have 8 on this machine) hangs
today with one "search" thread on 100%.
In screenshot you can see this betweeen 9:00-9:30 (this straight line
instead of peaks). In this time the whole cluster was not recheable for our
services.

Cheers,
Vadim

Am Montag, 15. Juli 2013 16:46:58 UTC+2 schrieb Norberto Meijome:

fwiw, the nodes u use as master/balancers use little memory and cpu
compared to the data instances. Almost no IO constraints either ( in some
cases, we even merge the LB and the ES master nodes with no problem).

On Tue, Jul 16, 2013 at 12:39 AM, Vadim Kisselmann <
v.kiss...@gmail.com> wrote:

Hi Guys,
thanks for your reply!
Norberto's solutions sound interesting. It's a good way to keep the
cluster stable.
And yes, it's definitely the case, that the master is stopping
fullfilling his work. This node was fully blocked.

@Norberto
We had no disk activity at these moment, only the high cpu usage on
master.

@Boaz

  1. Yes, we have bigger facet queries. In pretty formatted JSON
    about 1000 lines.
  2. We can try to set indices.recovery to debug level, but we have
    to wait for the next crash
  3. We have an splitted hardware setup now on AWS, 2x c1.xlarge
    instances (7GB RAM) and one with 15GB RAM(m1.xlarge). The last one is the
    master yet. The full index is not really big (30GB).

We restarted the both c1.xlarge instances today in the morning,
because both are running out of memory during the try to rebalance the
cluster(this causes that the whole cluster is not processing queries
anymore(!), although the master had no problems with RAM and especially on
this time no cpu problems):

[2013-07-15 09:45:23,538][INFO ][monitor.jvm] [esearch.cloud]
[gc][ConcurrentMarkSweep][640]******[163] duration [5.3s],
collections [1]/[5.5s], total [5.3s]/[8.5m], memory
[3.8gb]->[3.8gb]/[3.9gb], all_pools {[Code Cache]
[6.4mb]->[6.4mb]/[48mb]}{[Par Eden Space] [532.5mb]->[532.5mb]/[532.5mb]
**}{[Par Survivor Space] [9.5mb]->[6.2mb]/[66.5mb]}{[**CMS
Old Gen] [3.3gb]->[3.3gb]/[3.3gb]}{[CMS Perm Gen]
[35.3mb]->[35.3mb]/[82mb]}

See attached stacktrace from one c1.xlarge instance from this time.

The problem in my first mail(100% cpu) seems to be an different
problem, but it's the trigger for our slave-nodes to become crazy, so at
the end the whole cluster is unresponsive to answer requests.

Vadim

Am Montag, 15. Juli 2013 13:36:12 UTC+2 schrieb Norberto Meijome:

On 15/07/2013 9:14 PM, "Boaz Leskes" b.le...@gmail.com wrote:

Hi Vadim,

Norberto's suggestion will help to keep the cluster stable, even
in the case when one of the nodes becomes overloaded. What seems to be
happening is that you master node is being under enough pressure for it to
stop fulfilling it's role correctly.

That definitely seem to be the case.

B

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/grou

ps/opt_out https://groups.google.com/groups/opt_out.

--
Norberto 'Beto' Meijome

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
to**pic/elasticsearch/**89DaUKYrw4s/**unsubscribehttps://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**89DaUKYrw4s/unsubscribehttps://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Vadim,

Please do post your findings. I'd be very interested. We're having similar issues with cluster crashes, though have yet to find the root cause. Our setup is similar to Noberto's suggestion. Awhile back, client apps unicast to all nodes. Due to cluster stability issues, we changed this so client apps only talk to the masters. It really helped, and the cluster was stable for about a month. Then it crashed recently. After a complete restart, the cluster can't seem to stay up for more than 1-2 hrs. There is no indexing or search activity at this time. Yet, we're seeing nodes go in and out out the cluster, including masters, which just drives the elected master crazy. To the point where making a cluster health REST request to the master just hangs for a long time.

Thanks,
-Vinh

On Jul 15, 2013, at 2:35 AM, Norberto Meijome numard@gmail.com wrote:

Hi Vadim,
Do u have any disk activity?
We had similar cases in AWS where nodes would peg CPU and usually get bound to io too.
I changed the cluster so that we have 3 master nodes , with no data stored in them, on smaller instances. Then your data nodes, all of them obviously configured with master= false. The app servers speak to the masters only, via load balancers. On one hand, this smoothed out crazy spikes and all nodes are pretty much loaded quite evenly, and haven't seen a case where a node gets locked , or worse , a brain split.
As always, YMMV .

On 15/07/2013 7:09 PM, "Vadim Kisselmann" v.kisselmann@gmail.com wrote:
Hi Boaz,

we had an bigger crash last weekend, and now we have problems to rebalance our cluster. Suspiciously again at 3am.
The master is fully loaded with 100% cpu, this seems to block disk and network on AWS, because the other nodes don't replicate anything. You can
see with atop, that disk reads/writes on master are at 0, MBr/s is between 0-1MB.
Hot_threads an master are busy with over 100% CPU load. It's weird.
https://gist.github.com/vkisselmann/5998537

Cheers,
Vadim

Am Donnerstag, 11. Juli 2013 21:37:57 UTC+2 schrieb Boaz Leskes:
Hi Vadim,

I don't know of any bug that causes such symptoms but you never know. It may also be other stuff like scripts etc. Next time it happens (if it does, I understand it's rare) calling the hot threads api would really help diagnosing it ( http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads/ )

Cheers,
Boaz

On Wed, Jul 10, 2013 at 11:50 AM, Vadim Kisselmann v.kiss...@gmail.com wrote:
Hi Boaz,

thanks for your reply.

  1. It's the default setting, 3 nodes: 5 shards x 1 replica per index.
  2. It was the master(high cpu load, and only the cpu, ram, hdd i/o, network, everything was fine).
    After investigating everything like tomcat-logs from my services(connections, errors), settings and so on, i found nothing suspicious. Everything
    is like in the past months.
    I have only one idea: ES has an bug in this old version (19.11) and something caused an endless loop. Because only the cpu load was at 100% on all 8 cores, but nothing else on this machine.

Cheers,
Vadim

Am Dienstag, 9. Juli 2013 15:35:54 UTC+2 schrieb Boaz Leskes:
Hi Vadim,

Can you say a bit more about your cluster setup?

  1. How many primary shards you had per index? How many replicas?
  2. Was the node that experienced high cpu load also the cluster master at them? (you can see in the logs which node was elected master).

Cheers,
Boaz

On Tuesday, July 9, 2013 9:44:07 AM UTC+2, Vadim Kisselmann wrote:
Hi folks,

our cluster was "crashed" this night.
We have a couple symptoms and try to narrow down the problem.
Our setup: 3 Nodes in AWS, ES version 19.11, 4 indices for different services.
Master was node1. The cpu load of this node is rised suddenly to 100% from 3:00 to 3:30.
Other nodes cpu load was small. Logs are empty. It was only the cpu load, memory consumption, network, etc.everything was normal.
Services which wanted to connect to their indices are timed out after one minute with no response.

What is happened here? Could a "slow" query form only one service be a trigger for this? What about the other nodes in the cluster, why they did not provide
any results for other services from indices which are still working(on node 2 & 3)?

The full cluster restart was the only solution for this.
But how can we prevent this case(one node down, the whole cluster does not answer)?

Cheers,
Vadim

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Vinh, what version r u running? ES / JVM / OS ? Ec2?
On 20/07/2013 10:48 AM, "vinh" vinh@loggly.com wrote:

Hi Vadim,

Please do post your findings. I'd be very interested. We're having
similar issues with cluster crashes, though have yet to find the root
cause. Our setup is similar to Noberto's suggestion. Awhile back, client
apps unicast to all nodes. Due to cluster stability issues, we changed
this so client apps only talk to the masters. It really helped, and the
cluster was stable for about a month. Then it crashed recently. After a
complete restart, the cluster can't seem to stay up for more than 1-2 hrs.
There is no indexing or search activity at this time. Yet, we're seeing
nodes go in and out out the cluster, including masters, which just drives
the elected master crazy. To the point where making a cluster health REST
request to the master just hangs for a long time.

Thanks,
-Vinh

On Jul 15, 2013, at 2:35 AM, Norberto Meijome numard@gmail.com wrote:

Hi Vadim,
Do u have any disk activity?
We had similar cases in AWS where nodes would peg CPU and usually get
bound to io too.
I changed the cluster so that we have 3 master nodes , with no data stored
in them, on smaller instances. Then your data nodes, all of them obviously
configured with master= false. The app servers speak to the masters only,
via load balancers. On one hand, this smoothed out crazy spikes and all
nodes are pretty much loaded quite evenly, and haven't seen a case where a
node gets locked , or worse , a brain split.
As always, YMMV .
On 15/07/2013 7:09 PM, "Vadim Kisselmann" v.kisselmann@gmail.com wrote:

Hi Boaz,

we had an bigger crash last weekend, and now we have problems to
rebalance our cluster. Suspiciously again at 3am.
The master is fully loaded with 100% cpu, this seems to block disk and
network on AWS, because the other nodes don't replicate anything. You can
see with atop, that disk reads/writes on master are at 0, MBr/s is
between 0-1MB.
Hot_threads an master are busy with over 100% CPU load. It's weird.
https://gist.github.com/vkisselmann/5998537

Cheers,
Vadim

Am Donnerstag, 11. Juli 2013 21:37:57 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

I don't know of any bug that causes such symptoms but you never know. It
may also be other stuff like scripts etc. Next time it happens (if it does,
I understand it's rare) calling the hot threads api would really help
diagnosing it ( http://www.elasticsearch.org/**
guide/reference/api/admin-**cluster-nodes-hot-threads/http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads/)

Cheers,
Boaz

On Wed, Jul 10, 2013 at 11:50 AM, Vadim Kisselmann v.kiss...@gmail.comwrote:

Hi Boaz,

thanks for your reply.

  1. It's the default setting, 3 nodes: 5 shards x 1 replica per index.
  2. It was the master(high cpu load, and only the cpu, ram, hdd i/o,
    network, everything was fine).
    After investigating everything like tomcat-logs from my
    services(connections, errors), settings and so on, i found nothing
    suspicious. Everything
    is like in the past months.
    I have only one idea: ES has an bug in this old version (19.11) and
    something caused an endless loop. Because only the cpu load was at 100% on
    all 8 cores, but nothing else on this machine.

Cheers,
Vadim

Am Dienstag, 9. Juli 2013 15:35:54 UTC+2 schrieb Boaz Leskes:

Hi Vadim,

Can you say a bit more about your cluster setup?

  1. How many primary shards you had per index? How many replicas?
  2. Was the node that experienced high cpu load also the cluster master
    at them? (you can see in the logs which node was elected master).

Cheers,
Boaz

On Tuesday, July 9, 2013 9:44:07 AM UTC+2, Vadim Kisselmann wrote:

Hi folks,

our cluster was "crashed" this night.
We have a couple symptoms and try to narrow down the problem.
Our setup: 3 Nodes in AWS, ES version 19.11, 4 indices for different
services.
Master was node1. The cpu load of this node is rised suddenly to 100%
from 3:00 to 3:30.
Other nodes cpu load was small. Logs are empty. It was only the cpu
load, memory consumption, network, etc.everything was normal.
Services which wanted to connect to their indices are timed out after
one minute with no response.

What is happened here? Could a "slow" query form only one service be
a trigger for this? What about the other nodes in the cluster, why they did
not provide
any results for other services from indices which are still
working(on node 2 & 3)?

The full cluster restart was the only solution for this.
But how can we prevent this case(one node down, the whole cluster
does not answer)?

Cheers,
Vadim

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**89DaUKYrw4s/unsubscribehttps://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

ES 0.90.1 (also similar issues on another cluster using 0.90.2)
Java 1.7.0_11 (another cluster is using 1.7.0_25)
Ubuntu 12.04
A variety of EC2 instances (mostly m2.2xlarge)

I suspect we're encountering a variety of issues. One might be due to bad shards because some indexes initially have replicas=0, and anything can happen over time. My expectation though is that bad shards is a common scenario which the cluster should handle gracefully. Still no definite findings yet.

On Jul 19, 2013, at 9:18 PM, Norberto Meijome numard@gmail.com wrote:

Vinh, what version r u running? ES / JVM / OS ? Ec2?

On 20/07/2013 10:48 AM, "vinh" vinh@loggly.com wrote:
Hi Vadim,

Please do post your findings. I'd be very interested. We're having similar issues with cluster crashes, though have yet to find the root cause. Our setup is similar to Noberto's suggestion. Awhile back, client apps unicast to all nodes. Due to cluster stability issues, we changed this so client apps only talk to the masters. It really helped, and the cluster was stable for about a month. Then it crashed recently. After a complete restart, the cluster can't seem to stay up for more than 1-2 hrs. There is no indexing or search activity at this time. Yet, we're seeing nodes go in and out out the cluster, including masters, which just drives the elected master crazy. To the point where making a cluster health REST request to the master just hangs for a long time.

Thanks,
-Vinh

On Jul 15, 2013, at 2:35 AM, Norberto Meijome numard@gmail.com wrote:

Hi Vadim,
Do u have any disk activity?
We had similar cases in AWS where nodes would peg CPU and usually get bound to io too.
I changed the cluster so that we have 3 master nodes , with no data stored in them, on smaller instances. Then your data nodes, all of them obviously configured with master= false. The app servers speak to the masters only, via load balancers. On one hand, this smoothed out crazy spikes and all nodes are pretty much loaded quite evenly, and haven't seen a case where a node gets locked , or worse , a brain split.
As always, YMMV .

On 15/07/2013 7:09 PM, "Vadim Kisselmann" v.kisselmann@gmail.com wrote:
Hi Boaz,

we had an bigger crash last weekend, and now we have problems to rebalance our cluster. Suspiciously again at 3am.
The master is fully loaded with 100% cpu, this seems to block disk and network on AWS, because the other nodes don't replicate anything. You can
see with atop, that disk reads/writes on master are at 0, MBr/s is between 0-1MB.
Hot_threads an master are busy with over 100% CPU load. It's weird.
https://gist.github.com/vkisselmann/5998537

Cheers,
Vadim

Am Donnerstag, 11. Juli 2013 21:37:57 UTC+2 schrieb Boaz Leskes:
Hi Vadim,

I don't know of any bug that causes such symptoms but you never know. It may also be other stuff like scripts etc. Next time it happens (if it does, I understand it's rare) calling the hot threads api would really help diagnosing it ( http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads/ )

Cheers,
Boaz

On Wed, Jul 10, 2013 at 11:50 AM, Vadim Kisselmann v.kiss...@gmail.com wrote:
Hi Boaz,

thanks for your reply.

  1. It's the default setting, 3 nodes: 5 shards x 1 replica per index.
  2. It was the master(high cpu load, and only the cpu, ram, hdd i/o, network, everything was fine).
    After investigating everything like tomcat-logs from my services(connections, errors), settings and so on, i found nothing suspicious. Everything
    is like in the past months.
    I have only one idea: ES has an bug in this old version (19.11) and something caused an endless loop. Because only the cpu load was at 100% on all 8 cores, but nothing else on this machine.

Cheers,
Vadim

Am Dienstag, 9. Juli 2013 15:35:54 UTC+2 schrieb Boaz Leskes:
Hi Vadim,

Can you say a bit more about your cluster setup?

  1. How many primary shards you had per index? How many replicas?
  2. Was the node that experienced high cpu load also the cluster master at them? (you can see in the logs which node was elected master).

Cheers,
Boaz

On Tuesday, July 9, 2013 9:44:07 AM UTC+2, Vadim Kisselmann wrote:
Hi folks,

our cluster was "crashed" this night.
We have a couple symptoms and try to narrow down the problem.
Our setup: 3 Nodes in AWS, ES version 19.11, 4 indices for different services.
Master was node1. The cpu load of this node is rised suddenly to 100% from 3:00 to 3:30.
Other nodes cpu load was small. Logs are empty. It was only the cpu load, memory consumption, network, etc.everything was normal.
Services which wanted to connect to their indices are timed out after one minute with no response.

What is happened here? Could a "slow" query form only one service be a trigger for this? What about the other nodes in the cluster, why they did not provide
any results for other services from indices which are still working(on node 2 & 3)?

The full cluster restart was the only solution for this.
But how can we prevent this case(one node down, the whole cluster does not answer)?

Cheers,
Vadim

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.