Unstable cluster - "suspect illegal state: trying to move shard from primary mode to replica mode"

I have a 10 machine cluster (named es101->es110) with 32GB RAM per machine.
I've allocated 12GB per machine to Elasticsearch. Memory usage on the
machines looks ok, cpu and iowait is also not dramatic, nonetheless the
cluster is frequently becoming instable and losing nodes...
In the logs I am seeing entries like this:

[2014-05-24 10:48:03,336][INFO ][cluster.service ] [es104] master
{new
[es103][Cyvy8BPvRnyUR0EQvCmjMg][es103.muc.domeus.com][inet[/172.16.9.225:9300]]{master=true},
previous
[es102][cDXf0IgzRW2tsMPW4KlbTA][es102.muc.domeus.com][inet[/172.16.9.224:9300]]{master=true}},
removed
{[es102][cDXf0IgzRW2tsMPW4KlbTA][es102.muc.domeus.com][inet[/172.16.9.224:9300]]{master=true},},
added
{[es106][sQklfgSLS_upLZMz2j9O0w][es106.muc.domeus.com][inet[/172.16.9.228:9300]]{master=true},[es108][lgjzCUNUS9CNUOJIWlqlcg][es108.muc.domeus.com][inet[/172.16.9.230:9300]]{master=true},},
reason: zen-disco-receive(from master
[[es103][Cyvy8BPvRnyUR0EQvCmjMg][es103.muc.domeus.com][inet[/172.16.9.225:9300]]{master=true}])
[2014-05-24 10:48:03,423][WARN ][index.shard.service ] [es104]
[logstash-2014.01.01][8] suspect illegal state: trying to move shard from
primary mode to replica mode
[2014-05-24 10:48:03,423][WARN ][index.shard.service ] [es104]
[logstash-2014.05.15][7] suspect illegal state: trying to move shard from
primary mode to replica mode
[2014-05-24 10:48:03,423][WARN ][index.shard.service ] [es104]
[logstash-2014.01.13][8] suspect illegal state: trying to move shard from
primary mode to replica mode
[2014-05-24 10:48:03,423][WARN ][index.shard.service ] [es104]
[logstash-2014.01.13][9] suspect illegal state: trying to move shard from
primary mode to replica mode
[2014-05-24 10:48:03,424][WARN ][index.shard.service ] [es104]
[logstash-2014.05.13][8] suspect illegal state: trying to move shard from
primary mode to replica mode
[2014-05-24 10:48:03,424][WARN ][index.shard.service ] [es104]
[logstash-2014.01.17][7] suspect illegal state: trying to move shard from
primary mode to replica mode
[2014-05-24 10:48:03,424][WARN ][index.shard.service ] [es104]
[logstash-2014.01.15][9] suspect illegal state: trying to move shard from
primary mode to replica mode
[2014-05-24 10:48:03,424][WARN ][index.shard.service ] [es104]
[logstash-2014.03.24][7] suspect illegal state: trying to move shard from
primary mode to replica mode
[2014-05-24 10:48:03,440][WARN ][index.shard.service ] [es104]
[logstash-2014.05.19][9] suspect illegal state: trying to move shard from
primary mode to replica mode
[2014-05-24 10:48:03,458][WARN ][index.shard.service ] [es104]
[logstash-2014.03.13][7] suspect illegal state: trying to move shard from
primary mode to replica mode
[2014-05-24 10:48:42,801][INFO ][cluster.service ] [es104] master
{new
[es102][cDXf0IgzRW2tsMPW4KlbTA][es102.muc.domeus.com][inet[/172.16.9.224:9300]]{master=true},
previous
[es103][Cyvy8BPvRnyUR0EQvCmjMg][es103.muc.domeus.com][inet[/172.16.9.225:9300]]{master=true}},
removed
{[es106][sQklfgSLS_upLZMz2j9O0w][es106.muc.domeus.com][inet[/172.16.9.228:9300]]{master=true},[es108][lgjzCUNUS9CNUOJIWlqlcg][es108.muc.domeus.com][inet[/172.16.9.230:9300]]{master=true},},
added
{[es102][cDXf0IgzRW2tsMPW4KlbTA][es102.muc.domeus.com][inet[/172.16.9.224:9300]]{master=true},},
reason: zen-disco-receive(from master
[[es102][cDXf0IgzRW2tsMPW4KlbTA][es102.muc.domeus.com][inet[/172.16.9.224:9300]]{master=true}])
[2014-05-24 10:48:42,841][WARN ][index.shard.service ] [es104]
[logstash-2014.05.21][1] suspect illegal state: trying to move shard from
primary mode to replica mode

This is what the process looks like on one of the machines which has left
the cluster:
106 21437 114 51.3 473672116 16938980 ? SLl May21 4339:31
/usr/local/java/bin/java -Xms12g -Xmx12g -Xss256k -Djava.awt.headless=true
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+HeapDumpOnOutOfMemoryError -XX:CMSInitiatingOccupancyFraction=85
-Xmn1024m -Delasticsearch -Des.pidfile=/var/run/elasticsearch.pid
-Des.foreground=yes -Des.path.home=/usr/share/elasticsearch -cp
:/usr/share/elasticsearch/lib/elasticsearch-1.1.1.jar:/usr/share/elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/
-Des.default.config=/etc/elasticsearch/elasticsearch.yml
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=/DATA1/elasticsearch/log
-Des.default.path.work=/tmp/elasticsearch
-Des.default.path.conf=/etc/elasticsearch -Des.node.name=es104
org.elasticsearch.bootstrap.Elasticsearch

Any ideas what might be going on here, and better still how to remedy it?
I'm running elasticsearch 1.1.1 on debian 7

Cheers,
-Robin-

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a7d3e79e-f95b-4bd3-a38c-0720111d84b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

And found this error too in one of the nodes which left the cluster:

java.lang.NullPointerException
at
org.elasticsearch.gateway.local.state.meta.LocalGatewayMetaState.clusterChanged(LocalGatewayMetaState.java:185)
at
org.elasticsearch.gateway.local.LocalGateway.clusterChanged(LocalGateway.java:207)
at
org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:431)
at
org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:134)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

-Robin-

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2d810fd4-aafb-4d47-8ba7-304377e2e3e4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I think am running into this same issue, even after upgrading to 1.2.2.

Did you stabilize your cluster?

Thanks,
Mohamed.

On Saturday, May 24, 2014 5:05:55 AM UTC-4, Robin Clarke wrote:

And found this error too in one of the nodes which left the cluster:

java.lang.NullPointerException
at
org.elasticsearch.gateway.local.state.meta.LocalGatewayMetaState.clusterChanged(LocalGatewayMetaState.java:185)
at
org.elasticsearch.gateway.local.LocalGateway.clusterChanged(LocalGateway.java:207)
at
org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:431)
at
org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:134)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

-Robin-

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f7d09651-0f97-4529-b4e3-4cee752539e0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I adjusted the required master nodes to N-1, where N was the total number
of master nodes I have.

On 23 July 2014 15:36, Mohamed Lrhazi ml623@georgetown.edu wrote:

I think am running into this same issue, even after upgrading to 1.2.2.

Did you stabilize your cluster?

Thanks,
Mohamed.

On Saturday, May 24, 2014 5:05:55 AM UTC-4, Robin Clarke wrote:

And found this error too in one of the nodes which left the cluster:

java.lang.NullPointerException
at org.elasticsearch.gateway.local.state.meta.
LocalGatewayMetaState.clusterChanged(LocalGatewayMetaState.java:185)
at org.elasticsearch.gateway.local.LocalGateway.
clusterChanged(LocalGateway.java:207)
at org.elasticsearch.cluster.service.InternalClusterService$
UpdateTask.run(InternalClusterService.java:431)
at org.elasticsearch.common.util.concurrent.
PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(
PrioritizedEsThreadPoolExecutor.java:134)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

-Robin-

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/9XP5MwOkgk0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f7d09651-0f97-4529-b4e3-4cee752539e0%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f7d09651-0f97-4529-b4e3-4cee752539e0%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Best winds,
-Robin-
~:)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACX78vZyV19goRnBv4RR%3Dm_z%3DAFoFaxU67T0ndzWJWr%3Dru6Jbg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks Robin. For me the change that seemed to have worked, I keep my
fingers crossed, is to add node.master=False to all nodes except one...
black magic!

Thanks,
Mohamed.

On Sun, Jul 27, 2014 at 7:35 AM, Robin Clarke robin@robinclarke.net wrote:

I adjusted the required master nodes to N-1, where N was the total number
of master nodes I have.

On 23 July 2014 15:36, Mohamed Lrhazi ml623@georgetown.edu wrote:

I think am running into this same issue, even after upgrading to 1.2.2.

Did you stabilize your cluster?

Thanks,
Mohamed.

On Saturday, May 24, 2014 5:05:55 AM UTC-4, Robin Clarke wrote:

And found this error too in one of the nodes which left the cluster:

java.lang.NullPointerException
at org.elasticsearch.gateway.local.state.meta.
LocalGatewayMetaState.clusterChanged(LocalGatewayMetaState.java:185)
at org.elasticsearch.gateway.local.LocalGateway.
clusterChanged(LocalGateway.java:207)
at org.elasticsearch.cluster.service.InternalClusterService$
UpdateTask.run(InternalClusterService.java:431)
at org.elasticsearch.common.util.concurrent.
PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(
PrioritizedEsThreadPoolExecutor.java:134)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

-Robin-

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/9XP5MwOkgk0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f7d09651-0f97-4529-b4e3-4cee752539e0%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f7d09651-0f97-4529-b4e3-4cee752539e0%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Best winds,
-Robin-
~:)

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/9XP5MwOkgk0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CACX78vZyV19goRnBv4RR%3Dm_z%3DAFoFaxU67T0ndzWJWr%3Dru6Jbg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CACX78vZyV19goRnBv4RR%3Dm_z%3DAFoFaxU67T0ndzWJWr%3Dru6Jbg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEU_gmfX8hMRD-BpvMAzhGoATmBTULjKA3TE7yAzGFsvs8j2bw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.