ElasticSearch cannot join cluster

I found the problem with elasticSearch cluster on machine 1 found timeout.
the elastic search log are below:

[2012-07-01 00:10:11,876][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [63911ms] ago, timed out [33911ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [652733]

[2012-07-01 00:10:11,877][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [33912ms] ago, timed out [0ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [652734]

[2012-07-01 00:12:08,150][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [50502ms] ago, timed out [20502ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [652782]

[2012-07-01 00:19:59,034][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [61803ms] ago, timed out [31803ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [652917]

[2012-07-01 00:19:59,034][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [31803ms] ago, timed out [1803ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [652918]

[2012-07-01 00:28:03,589][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [38953ms] ago, timed out [8953ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [653028]

[2012-07-01 00:34:13,351][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [68650ms] ago, timed out [38650ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [653034]

[2012-07-01 00:34:43,747][INFO ][discovery.zen ] [Masaryk, Milos] master_left [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]

[2012-07-01 00:36:34,455][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [38650ms] ago, timed out [8650ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [653035]

[2012-07-01 00:37:43,467][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [218766ms] ago, timed out [188766ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [653036]

[2012-07-01 00:37:43,467][INFO ][cluster.service ] [Masaryk, Milos] master {new [Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]], previous [Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]}, removed {[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]],}, reason: zen-disco-master_failed ([Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]])

[2012-07-01 00:39:44,042][INFO ][discovery.zen ] [Masaryk, Milos] master_left [[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]]], reason [no longer master]

[2012-07-01 00:39:44,043][INFO ][cluster.service ] [Masaryk, Milos] master {new [Masaryk, Milos][c6PfkH6uQ4-ACL4nacXAug][inet[/172.20.118.210:9300]], previous [Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]]}, removed {[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]],}, reason: zen-disco-master_failed ([Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]])

[2012-07-01 00:40:50,896][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [65851ms] ago, timed out [11206ms] ago, action [discovery/zen/fd/ping], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][inet[/172.20.118.215:9300]]], id [653041]

[2012-07-01 00:43:33,677][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [173984ms] ago, timed out [139279ms] ago, action [/cluster/nodes/indices/shard/store/n], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][inet[/172.20.118.215:9300]]], id [653044]

[2012-07-01 00:43:33,677][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [207809ms] ago, timed out [173985ms] ago, action [/gateway/local/started-shards/n], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][inet[/172.20.118.215:9300]]], id [653042]

[2012-07-01 00:44:37,569][WARN ][index.shard.service ] [Masaryk, Milos] [directory][3] suspect illegal state: trying to move shard from primary mode to backup mode

[2012-07-01 00:44:46,508][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 00:45:40,538][INFO ][cluster.service ] [Masaryk, Milos] removed {[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][inet[/172.20.118.215:9300]],}, reason: zen-disco-node_failed([Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][inet[/172.20.118.215:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout

[2012-07-01 00:45:48,543][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 00:58:32,684][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 00:58:45,582][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 01:00:01,147][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 01:00:01,338][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 01:00:01,360][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 01:03:54,205][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]]]

[2012-07-01 01:04:45,344][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]]]

[2012-07-01 01:04:45,345][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]]]

[2012-07-01 01:05:07,703][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][653683][1422] duration [16.7s], collections [1]/[17.5s], total [16.7s]/[25.5s], memory [102mb]->[87.7mb]/[1019.8mb]

[2012-07-01 01:09:09,634][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 01:18:30,490][WARN ][index.engine.robin ] [Masaryk, Milos] [directory][1] failed to flush after setting shard to inactive
org.elasticsearch.index.engine.EngineClosedException: [directory][1] CurrentState[CLOSED]
at org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:779)
at org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:218)
at org.elasticsearch.indices.memory.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:178)
at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:279)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

[2012-07-01 01:18:30,491][WARN ][index.engine.robin ] [Masaryk, Milos] [directory][4] failed to flush after setting shard to inactive
org.elasticsearch.index.engine.EngineClosedException: [directory][4] CurrentState[CLOSED]
at org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:779)
at org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:218)
at org.elasticsearch.indices.memory.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:178)
at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:279)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

[2012-07-01 01:38:02,862][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][654868][1423] duration [3.4s], collections [1]/[1s], total [3.4s]/[28.9s], memory [104mb]->[104mb]/[1019.8mb]

[2012-07-01 02:37:16,178][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][656829][1425] duration [32.6s], collections [1]/[1.3m], total [32.6s]/[1m], memory [107.1mb]->[91.5mb]/[1019.8mb]

[2012-07-01 03:55:11,063][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][659670][1429] duration [12.8s], collections [1]/[37.9s], total [12.8s]/[1.2m], memory [110.4mb]->[96mb]/[1019.8mb]

[2012-07-01 04:55:08,395][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][661918][1432] duration [17.8s], collections [1]/[17.9s], total [17.8s]/[1.5m], memory [112.5mb]->[96.6mb]/[1019.8mb]

[2012-07-01 05:11:00,493][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][662567][1433] duration [1.5m], collections [1]/[1.5m], total [1.5m]/[3m], memory [113.1mb]->[99.5mb]/[1019.8mb]

[2012-07-01 06:06:28,152][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][664137][1436] duration [54.3s], collections [1]/[54.3s], total [54.3s]/[3.9m], memory [119.1mb]->[105.5mb]/[1019.8mb]

I use elasticsearch 0.19.1 on 4 machine. the weird thing found on elasticsearch-head that attacted with this topic.

Sicker wrote:

I found the problem with elasticSearch cluster on machine 1 found timeout.
the Elasticsearch log are below:

[...]

[2012-07-01 00:34:43,747][INFO ][discovery.zen ] [Masaryk, Milos]
master_left
[[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]],
reason [failed to ping, tried [3] times, each with maximum [30s]
timeout]

[...]

[2012-07-01 00:37:43,467][INFO ][cluster.service ] [Masaryk, Milos]
master {new
[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]],
previous
[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]},
removed
{[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]],},
reason: zen-disco-master_failed
([Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]])

[2012-07-01 00:39:44,042][INFO ][discovery.zen ] [Masaryk, Milos]
master_left
[[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]]],
reason [no longer master]

[2012-07-01 00:39:44,043][INFO ][cluster.service ] [Masaryk, Milos]
master {new [Masaryk,
Milos][c6PfkH6uQ4-ACL4nacXAug][inet[/172.20.118.210:9300]],
previous
[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]]},
removed
{[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]],},
reason: zen-disco-master_failed
([Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]])

172.20.118.210 is struggling to communicate with the rest of the
nodes. What is the health and load avg on each node?

-Drew

All node are red and don't see each other, after restart all node, it's now
ok.

I think the network between each node down to long after network come back.
the node is not try to find the cluster again.

On Tuesday, July 3, 2012 7:50:52 PM UTC+7, Sicker wrote:

I found the problem with elasticSearch cluster on machine 1 found timeout.
the Elasticsearch log are below:

[2012-07-01 00:10:11,876][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [63911ms] ago, timed out [33911ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [652733]

[2012-07-01 00:10:11,877][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [33912ms] ago, timed out [0ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [652734]

[2012-07-01 00:12:08,150][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [50502ms] ago, timed out [20502ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [652782]

[2012-07-01 00:19:59,034][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [61803ms] ago, timed out [31803ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [652917]

[2012-07-01 00:19:59,034][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [31803ms] ago, timed out [1803ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [652918]

[2012-07-01 00:28:03,589][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [38953ms] ago, timed out [8953ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [653028]

[2012-07-01 00:34:13,351][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [68650ms] ago, timed out [38650ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [653034]

[2012-07-01 00:34:43,747][INFO ][discovery.zen ] [Masaryk, Milos] master_left [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]

[2012-07-01 00:36:34,455][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [38650ms] ago, timed out [8650ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [653035]

[2012-07-01 00:37:43,467][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [218766ms] ago, timed out [188766ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [653036]

[2012-07-01 00:37:43,467][INFO ][cluster.service ] [Masaryk, Milos] master {new [Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]], previous [Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]}, removed {[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]],}, reason: zen-disco-master_failed ([Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]])

[2012-07-01 00:39:44,042][INFO ][discovery.zen ] [Masaryk, Milos] master_left [[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]]], reason [no longer master]

[2012-07-01 00:39:44,043][INFO ][cluster.service ] [Masaryk, Milos] master {new [Masaryk, Milos][c6PfkH6uQ4-ACL4nacXAug][inet[/172.20.118.210:9300]], previous [Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]]}, removed {[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]],}, reason: zen-disco-master_failed ([Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]])

[2012-07-01 00:40:50,896][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [65851ms] ago, timed out [11206ms] ago, action [discovery/zen/fd/ping], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][inet[/172.20.118.215:9300]]], id [653041]

[2012-07-01 00:43:33,677][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [173984ms] ago, timed out [139279ms] ago, action [/cluster/nodes/indices/shard/store/n], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][inet[/172.20.118.215:9300]]], id [653044]

[2012-07-01 00:43:33,677][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [207809ms] ago, timed out [173985ms] ago, action [/gateway/local/started-shards/n], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][inet[/172.20.118.215:9300]]], id [653042]

[2012-07-01 00:44:37,569][WARN ][index.shard.service ] [Masaryk, Milos] [directory][3] suspect illegal state: trying to move shard from primary mode to backup mode

[2012-07-01 00:44:46,508][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 00:45:40,538][INFO ][cluster.service ] [Masaryk, Milos] removed {[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][inet[/172.20.118.215:9300]],}, reason: zen-disco-node_failed([Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][inet[/172.20.118.215:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout

[2012-07-01 00:45:48,543][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 00:58:32,684][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 00:58:45,582][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 01:00:01,147][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 01:00:01,338][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 01:00:01,360][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 01:03:54,205][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]]]

[2012-07-01 01:04:45,344][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]]]

[2012-07-01 01:04:45,345][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/172.20.118.209:9300]]]

[2012-07-01 01:05:07,703][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][653683][1422] duration [16.7s], collections [1]/[17.5s], total [16.7s]/[25.5s], memory [102mb]->[87.7mb]/[1019.8mb]

[2012-07-01 01:09:09,634][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 01:18:30,490][WARN ][index.engine.robin ] [Masaryk, Milos] [directory][1] failed to flush after setting shard to inactive
org.elasticsearch.index.engine.EngineClosedException: [directory][1] CurrentState[CLOSED]
at org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:779)
at org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:218)
at org.elasticsearch.indices.memory.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:178)
at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:279)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

[2012-07-01 01:18:30,491][WARN ][index.engine.robin ] [Masaryk, Milos] [directory][4] failed to flush after setting shard to inactive
org.elasticsearch.index.engine.EngineClosedException: [directory][4] CurrentState[CLOSED]
at org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:779)
at org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:218)
at org.elasticsearch.indices.memory.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:178)
at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:279)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

[2012-07-01 01:38:02,862][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][654868][1423] duration [3.4s], collections [1]/[1s], total [3.4s]/[28.9s], memory [104mb]->[104mb]/[1019.8mb]

[2012-07-01 02:37:16,178][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][656829][1425] duration [32.6s], collections [1]/[1.3m], total [32.6s]/[1m], memory [107.1mb]->[91.5mb]/[1019.8mb]

[2012-07-01 03:55:11,063][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][659670][1429] duration [12.8s], collections [1]/[37.9s], total [12.8s]/[1.2m], memory [110.4mb]->[96mb]/[1019.8mb]

[2012-07-01 04:55:08,395][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][661918][1432] duration [17.8s], collections [1]/[17.9s], total [17.8s]/[1.5m], memory [112.5mb]->[96.6mb]/[1019.8mb]

[2012-07-01 05:11:00,493][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][662567][1433] duration [1.5m], collections [1]/[1.5m], total [1.5m]/[3m], memory [113.1mb]->[99.5mb]/[1019.8mb]

[2012-07-01 06:06:28,152][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][664137][1436] duration [54.3s], collections [1]/[54.3s], total [54.3s]/[3.9m], memory [119.1mb]->[105.5mb]/[1019.8mb]

I use elasticsearch 0.19.1 on 4 machine. the weird thing found on elasticsearch-head that attacted with this topic.

In 0.19.5 and above we improved the logic of exactly the failure you are
seeing, might make sense to upgrade to the latest version.

On Tue, Jul 3, 2012 at 5:13 PM, Sicker sicker27@gmail.com wrote:

All node are red and don't see each other, after restart all node, it's
now ok.

I think the network between each node down to long after network come
back. the node is not try to find the cluster again.

On Tuesday, July 3, 2012 7:50:52 PM UTC+7, Sicker wrote:

I found the problem with elasticSearch cluster on machine 1 found
timeout. the Elasticsearch log are below:

[2012-07-01 00:10:11,876][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [63911ms] ago, timed out [33911ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [652733]

[2012-07-01 00:10:11,877][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [33912ms] ago, timed out [0ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [652734]

[2012-07-01 00:12:08,150][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [50502ms] ago, timed out [20502ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [652782]

[2012-07-01 00:19:59,034][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [61803ms] ago, timed out [31803ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [652917]

[2012-07-01 00:19:59,034][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [31803ms] ago, timed out [1803ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [652918]

[2012-07-01 00:28:03,589][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [38953ms] ago, timed out [8953ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [653028]

[2012-07-01 00:34:13,351][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [68650ms] ago, timed out [38650ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [653034]

[2012-07-01 00:34:43,747][INFO ][discovery.zen ] [Masaryk, Milos] master_left [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]

[2012-07-01 00:36:34,455][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [38650ms] ago, timed out [8650ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [653035]

[2012-07-01 00:37:43,467][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [218766ms] ago, timed out [188766ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [653036]

[2012-07-01 00:37:43,467][INFO ][cluster.service ] [Masaryk, Milos] master {new [Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]], previous [Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]}, removed {[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]],}, reason: zen-disco-master_failed ([Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]])

[2012-07-01 00:39:44,042][INFO ][discovery.zen ] [Masaryk, Milos] master_left [[Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]]], reason [no longer master]

[2012-07-01 00:39:44,043][INFO ][cluster.service ] [Masaryk, Milos] master {new [Masaryk, Milos][c6PfkH6uQ4-ACL4nacXAug]**[inet[/172.20.118.210:9300]], previous [Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]]}, removed {[Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]],}, reason: zen-disco-master_failed ([Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]])

[2012-07-01 00:40:50,896][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [65851ms] ago, timed out [11206ms] ago, action [discovery/zen/fd/ping], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][**inet[/172.20.118.215:9300]]], id [653041]

[2012-07-01 00:43:33,677][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [173984ms] ago, timed out [139279ms] ago, action [/cluster/nodes/indices/shard/**store/n], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][**inet[/172.20.118.215:9300]]], id [653044]

[2012-07-01 00:43:33,677][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [207809ms] ago, timed out [173985ms] ago, action [/gateway/local/started-**shards/n], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][**inet[/172.20.118.215:9300]]], id [653042]

[2012-07-01 00:44:37,569][WARN ][index.shard.service ] [Masaryk, Milos] [directory][3] suspect illegal state: trying to move shard from primary mode to backup mode

[2012-07-01 00:44:46,508][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 00:45:40,538][INFO ][cluster.service ] [Masaryk, Milos] removed {[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][**inet[/172.20.118.215:9300]],}, reason: zen-disco-node_failed([Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][**inet[/172.20.118.215:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout

[2012-07-01 00:45:48,543][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 00:58:32,684][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 00:58:45,582][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 01:00:01,147][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 01:00:01,338][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 01:00:01,360][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 01:03:54,205][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]]]

[2012-07-01 01:04:45,344][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]]]

[2012-07-01 01:04:45,345][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]]]

[2012-07-01 01:05:07,703][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][653683][1422] duration [16.7s], collections [1]/[17.5s], total [16.7s]/[25.5s], memory [102mb]->[87.7mb]/[1019.8mb]

[2012-07-01 01:09:09,634][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 01:18:30,490][WARN ][index.engine.robin ] [Masaryk, Milos] [directory][1] failed to flush after setting shard to inactive
org.elasticsearch.index.**engine.EngineClosedException: [directory][1] CurrentState[CLOSED]
at org.elasticsearch.index.**engine.robin.RobinEngine.**flush(RobinEngine.java:779)
at org.elasticsearch.index.**engine.robin.RobinEngine.**updateIndexingBufferSize(**RobinEngine.java:218)
at org.elasticsearch.indices.**memory.**IndexingMemoryController$**ShardsIndicesStatusChecker.**run(IndexingMemoryController.**java:178)
at org.elasticsearch.threadpool.**ThreadPool$LoggingRunnable.**run(ThreadPool.java:279)
at java.util.concurrent.**Executors$RunnableAdapter.**call(Executors.java:441)
at java.util.concurrent.**FutureTask$Sync.**innerRunAndReset(FutureTask.**java:317)
at java.util.concurrent.**FutureTask.runAndReset(**FutureTask.java:150)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.access$**101(**ScheduledThreadPoolExecutor.**java:98)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.**runPeriodic(**ScheduledThreadPoolExecutor.**java:180)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.run(**ScheduledThreadPoolExecutor.**java:204)
at java.util.concurrent.**ThreadPoolExecutor$Worker.**runTask(ThreadPoolExecutor.**java:886)
at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.**java:662)

[2012-07-01 01:18:30,491][WARN ][index.engine.robin ] [Masaryk, Milos] [directory][4] failed to flush after setting shard to inactive
org.elasticsearch.index.**engine.EngineClosedException: [directory][4] CurrentState[CLOSED]
at org.elasticsearch.index.**engine.robin.RobinEngine.**flush(RobinEngine.java:779)
at org.elasticsearch.index.**engine.robin.RobinEngine.**updateIndexingBufferSize(**RobinEngine.java:218)
at org.elasticsearch.indices.**memory.**IndexingMemoryController$**ShardsIndicesStatusChecker.**run(IndexingMemoryController.**java:178)
at org.elasticsearch.threadpool.**ThreadPool$LoggingRunnable.**run(ThreadPool.java:279)
at java.util.concurrent.**Executors$RunnableAdapter.**call(Executors.java:441)
at java.util.concurrent.**FutureTask$Sync.**innerRunAndReset(FutureTask.**java:317)
at java.util.concurrent.**FutureTask.runAndReset(**FutureTask.java:150)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.access$**101(**ScheduledThreadPoolExecutor.**java:98)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.**runPeriodic(**ScheduledThreadPoolExecutor.**java:180)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.run(**ScheduledThreadPoolExecutor.**java:204)
at java.util.concurrent.**ThreadPoolExecutor$Worker.**runTask(ThreadPoolExecutor.**java:886)
at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.**java:662)

[2012-07-01 01:38:02,862][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][654868][1423] duration [3.4s], collections [1]/[1s], total [3.4s]/[28.9s], memory [104mb]->[104mb]/[1019.8mb]

[2012-07-01 02:37:16,178][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][656829][1425] duration [32.6s], collections [1]/[1.3m], total [32.6s]/[1m], memory [107.1mb]->[91.5mb]/[1019.8mb]

[2012-07-01 03:55:11,063][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][659670][1429] duration [12.8s], collections [1]/[37.9s], total [12.8s]/[1.2m], memory [110.4mb]->[96mb]/[1019.8mb]

[2012-07-01 04:55:08,395][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][661918][1432] duration [17.8s], collections [1]/[17.9s], total [17.8s]/[1.5m], memory [112.5mb]->[96.6mb]/[1019.8mb]

[2012-07-01 05:11:00,493][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][662567][1433] duration [1.5m], collections [1]/[1.5m], total [1.5m]/[3m], memory [113.1mb]->[99.5mb]/[1019.8mb]

[2012-07-01 06:06:28,152][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][664137][1436] duration [54.3s], collections [1]/[54.3s], total [54.3s]/[3.9m], memory [119.1mb]->[105.5mb]/[1019.**8mb]

I use elasticsearch 0.19.1 on 4 machine. the weird thing found on elasticsearch-head that attacted with this topic.

What is the behaviour of this case in 0.19.5?

On Tuesday, July 3, 2012 11:18:47 PM UTC+7, kimchy wrote:

In 0.19.5 and above we improved the logic of exactly the failure you are
seeing, might make sense to upgrade to the latest version.

On Tue, Jul 3, 2012 at 5:13 PM, Sicker sicker27@gmail.com wrote:

All node are red and don't see each other, after restart all node, it's
now ok.

I think the network between each node down to long after network come
back. the node is not try to find the cluster again.

On Tuesday, July 3, 2012 7:50:52 PM UTC+7, Sicker wrote:

I found the problem with elasticSearch cluster on machine 1 found
timeout. the Elasticsearch log are below:

[2012-07-01 00:10:11,876][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [63911ms] ago, timed out [33911ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [652733]

[2012-07-01 00:10:11,877][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [33912ms] ago, timed out [0ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [652734]

[2012-07-01 00:12:08,150][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [50502ms] ago, timed out [20502ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [652782]

[2012-07-01 00:19:59,034][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [61803ms] ago, timed out [31803ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [652917]

[2012-07-01 00:19:59,034][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [31803ms] ago, timed out [1803ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [652918]

[2012-07-01 00:28:03,589][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [38953ms] ago, timed out [8953ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [653028]

[2012-07-01 00:34:13,351][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [68650ms] ago, timed out [38650ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [653034]

[2012-07-01 00:34:43,747][INFO ][discovery.zen ] [Masaryk, Milos] master_left [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]

[2012-07-01 00:36:34,455][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [38650ms] ago, timed out [8650ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [653035]

[2012-07-01 00:37:43,467][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [218766ms] ago, timed out [188766ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [653036]

[2012-07-01 00:37:43,467][INFO ][cluster.service ] [Masaryk, Milos] master {new [Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]], previous [Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]}, removed {[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]],}, reason: zen-disco-master_failed ([Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]])

[2012-07-01 00:39:44,042][INFO ][discovery.zen ] [Masaryk, Milos] master_left [[Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]]], reason [no longer master]

[2012-07-01 00:39:44,043][INFO ][cluster.service ] [Masaryk, Milos] master {new [Masaryk, Milos][c6PfkH6uQ4-ACL4nacXAug]**[inet[/172.20.118.210:9300]], previous [Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]]}, removed {[Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]],}, reason: zen-disco-master_failed ([Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]])

[2012-07-01 00:40:50,896][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [65851ms] ago, timed out [11206ms] ago, action [discovery/zen/fd/ping], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][**inet[/172.20.118.215:9300]]], id [653041]

[2012-07-01 00:43:33,677][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [173984ms] ago, timed out [139279ms] ago, action [/cluster/nodes/indices/shard/**store/n], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][**inet[/172.20.118.215:9300]]], id [653044]

[2012-07-01 00:43:33,677][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [207809ms] ago, timed out [173985ms] ago, action [/gateway/local/started-**shards/n], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][**inet[/172.20.118.215:9300]]], id [653042]

[2012-07-01 00:44:37,569][WARN ][index.shard.service ] [Masaryk, Milos] [directory][3] suspect illegal state: trying to move shard from primary mode to backup mode

[2012-07-01 00:44:46,508][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 00:45:40,538][INFO ][cluster.service ] [Masaryk, Milos] removed {[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][**inet[/172.20.118.215:9300]],}, reason: zen-disco-node_failed([Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][**inet[/172.20.118.215:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout

[2012-07-01 00:45:48,543][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 00:58:32,684][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 00:58:45,582][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 01:00:01,147][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 01:00:01,338][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 01:00:01,360][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 01:03:54,205][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]]]

[2012-07-01 01:04:45,344][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]]]

[2012-07-01 01:04:45,345][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]]]

[2012-07-01 01:05:07,703][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][653683][1422] duration [16.7s], collections [1]/[17.5s], total [16.7s]/[25.5s], memory [102mb]->[87.7mb]/[1019.8mb]

[2012-07-01 01:09:09,634][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 01:18:30,490][WARN ][index.engine.robin ] [Masaryk, Milos] [directory][1] failed to flush after setting shard to inactive
org.elasticsearch.index.**engine.EngineClosedException: [directory][1] CurrentState[CLOSED]
at org.elasticsearch.index.**engine.robin.RobinEngine.**flush(RobinEngine.java:779)
at org.elasticsearch.index.**engine.robin.RobinEngine.**updateIndexingBufferSize(**RobinEngine.java:218)
at org.elasticsearch.indices.**memory.**IndexingMemoryController$**ShardsIndicesStatusChecker.**run(IndexingMemoryController.**java:178)
at org.elasticsearch.threadpool.**ThreadPool$LoggingRunnable.**run(ThreadPool.java:279)
at java.util.concurrent.**Executors$RunnableAdapter.**call(Executors.java:441)
at java.util.concurrent.**FutureTask$Sync.**innerRunAndReset(FutureTask.**java:317)
at java.util.concurrent.**FutureTask.runAndReset(**FutureTask.java:150)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.access$**101(**ScheduledThreadPoolExecutor.**java:98)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.**runPeriodic(**ScheduledThreadPoolExecutor.**java:180)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.run(**ScheduledThreadPoolExecutor.**java:204)
at java.util.concurrent.**ThreadPoolExecutor$Worker.**runTask(ThreadPoolExecutor.**java:886)
at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.**java:662)

[2012-07-01 01:18:30,491][WARN ][index.engine.robin ] [Masaryk, Milos] [directory][4] failed to flush after setting shard to inactive
org.elasticsearch.index.**engine.EngineClosedException: [directory][4] CurrentState[CLOSED]
at org.elasticsearch.index.**engine.robin.RobinEngine.**flush(RobinEngine.java:779)
at org.elasticsearch.index.**engine.robin.RobinEngine.**updateIndexingBufferSize(**RobinEngine.java:218)
at org.elasticsearch.indices.**memory.**IndexingMemoryController$**ShardsIndicesStatusChecker.**run(IndexingMemoryController.**java:178)
at org.elasticsearch.threadpool.**ThreadPool$LoggingRunnable.**run(ThreadPool.java:279)
at java.util.concurrent.**Executors$RunnableAdapter.**call(Executors.java:441)
at java.util.concurrent.**FutureTask$Sync.**innerRunAndReset(FutureTask.**java:317)
at java.util.concurrent.**FutureTask.runAndReset(**FutureTask.java:150)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.access$**101(**ScheduledThreadPoolExecutor.**java:98)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.**runPeriodic(**ScheduledThreadPoolExecutor.**java:180)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.run(**ScheduledThreadPoolExecutor.**java:204)
at java.util.concurrent.**ThreadPoolExecutor$Worker.**runTask(ThreadPoolExecutor.**java:886)
at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.**java:662)

[2012-07-01 01:38:02,862][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][654868][1423] duration [3.4s], collections [1]/[1s], total [3.4s]/[28.9s], memory [104mb]->[104mb]/[1019.8mb]

[2012-07-01 02:37:16,178][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][656829][1425] duration [32.6s], collections [1]/[1.3m], total [32.6s]/[1m], memory [107.1mb]->[91.5mb]/[1019.8mb]

[2012-07-01 03:55:11,063][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][659670][1429] duration [12.8s], collections [1]/[37.9s], total [12.8s]/[1.2m], memory [110.4mb]->[96mb]/[1019.8mb]

[2012-07-01 04:55:08,395][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][661918][1432] duration [17.8s], collections [1]/[17.9s], total [17.8s]/[1.5m], memory [112.5mb]->[96.6mb]/[1019.8mb]

[2012-07-01 05:11:00,493][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][662567][1433] duration [1.5m], collections [1]/[1.5m], total [1.5m]/[3m], memory [113.1mb]->[99.5mb]/[1019.8mb]

[2012-07-01 06:06:28,152][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][664137][1436] duration [54.3s], collections [1]/[54.3s], total [54.3s]/[3.9m], memory [119.1mb]->[105.5mb]/[1019.**8mb]

I use elasticsearch 0.19.1 on 4 machine. the weird thing found on elasticsearch-head that attacted with this topic.

Is it this one fixed in 0.19.5? -

On Wed, Jul 4, 2012 at 1:54 PM, Sicker sicker27@gmail.com wrote:

What is the behaviour of this case in 0.19.5?

On Tuesday, July 3, 2012 11:18:47 PM UTC+7, kimchy wrote:

In 0.19.5 and above we improved the logic of exactly the failure you are
seeing, might make sense to upgrade to the latest version.

On Tue, Jul 3, 2012 at 5:13 PM, Sicker sicker27@gmail.com wrote:

All node are red and don't see each other, after restart all node, it's
now ok.

I think the network between each node down to long after network come
back. the node is not try to find the cluster again.

On Tuesday, July 3, 2012 7:50:52 PM UTC+7, Sicker wrote:

I found the problem with elasticSearch cluster on machine 1 found
timeout. the Elasticsearch log are below:

[2012-07-01 00:10:11,876][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [63911ms] ago, timed out [33911ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [652733]

[2012-07-01 00:10:11,877][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [33912ms] ago, timed out [0ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [652734]

[2012-07-01 00:12:08,150][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [50502ms] ago, timed out [20502ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [652782]

[2012-07-01 00:19:59,034][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [61803ms] ago, timed out [31803ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [652917]

[2012-07-01 00:19:59,034][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [31803ms] ago, timed out [1803ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [652918]

[2012-07-01 00:28:03,589][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [38953ms] ago, timed out [8953ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [653028]

[2012-07-01 00:34:13,351][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [68650ms] ago, timed out [38650ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [653034]

[2012-07-01 00:34:43,747][INFO ][discovery.zen ] [Masaryk, Milos] master_left [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]

[2012-07-01 00:36:34,455][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [38650ms] ago, timed out [8650ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [653035]

[2012-07-01 00:37:43,467][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [218766ms] ago, timed out [188766ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]], id [653036]

[2012-07-01 00:37:43,467][INFO ][cluster.service ] [Masaryk, Milos] master {new [Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:**9300]], previous [Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]}, removed {[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]],}, reason: zen-disco-master_failed ([Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]])

[2012-07-01 00:39:44,042][INFO ][discovery.zen ] [Masaryk, Milos] master_left [[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:**9300]]], reason [no longer master]

[2012-07-01 00:39:44,043][INFO ][cluster.service ] [Masaryk, Milos] master {new [Masaryk, Milos][c6PfkH6uQ4-ACL4nacXAug]****[inet[/172.20.118.210:9300]], previous [Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:**9300]]}, removed {[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:**9300]],}, reason: zen-disco-master_failed ([Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:**9300]])

[2012-07-01 00:40:50,896][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [65851ms] ago, timed out [11206ms] ago, action [discovery/zen/fd/ping], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][****inet[/172.20.118.215:9300]]], id [653041]

[2012-07-01 00:43:33,677][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [173984ms] ago, timed out [139279ms] ago, action [/cluster/nodes/indices/shard/****store/n], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][****inet[/172.20.118.215:9300]]], id [653044]

[2012-07-01 00:43:33,677][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [207809ms] ago, timed out [173985ms] ago, action [/gateway/local/started-shards/n], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][****inet[/172.20.118.215:9300]]], id [653042]

[2012-07-01 00:44:37,569][WARN ][index.shard.service ] [Masaryk, Milos] [directory][3] suspect illegal state: trying to move shard from primary mode to backup mode

[2012-07-01 00:44:46,508][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 00:45:40,538][INFO ][cluster.service ] [Masaryk, Milos] removed {[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][****inet[/172.20.118.215:9300]],}, reason: zen-disco-node_failed([Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][****inet[/172.20.118.215:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout

[2012-07-01 00:45:48,543][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 00:58:32,684][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 00:58:45,582][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 01:00:01,147][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 01:00:01,338][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 01:00:01,360][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 01:03:54,205][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:**9300]]]

[2012-07-01 01:04:45,344][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:**9300]]]

[2012-07-01 01:04:45,345][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:**9300]]]

[2012-07-01 01:05:07,703][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][653683][1422] duration [16.7s], collections [1]/[17.5s], total [16.7s]/[25.5s], memory [102mb]->[87.7mb]/[1019.8mb]

[2012-07-01 01:09:09,634][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][T7DyaW6PSuWONZK4NNw3Bw][inet[/172.20.118.208:9300]]]

[2012-07-01 01:18:30,490][WARN ][index.engine.robin ] [Masaryk, Milos] [directory][1] failed to flush after setting shard to inactive
org.elasticsearch.index.engine.EngineClosedException: [directory][1] CurrentState[CLOSED]
at org.elasticsearch.index.engine.robin.RobinEngine.**flush(**RobinEngine.java:779)
at org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(**RobinEngine.**java:218)
at org.elasticsearch.indices.memory.**IndexingMemoryController$ShardsIndicesStatusChecker.**run(**IndexingMemoryController.**java:**178)
at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.**run(**ThreadPool.java:279)
at java.util.concurrent.Executors$RunnableAdapter.**call(**Executors.java:441)
at java.util.concurrent.FutureTask$Sync.**innerRunAndReset(**FutureTask.**java:317)
at java.util.concurrent.FutureTask.runAndReset(**FutureTask.java:**150)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$**101(ScheduledThreadPoolExecutor.**java:98)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.**runPeriodic(ScheduledThreadPoolExecutor.**java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.**java:204)
at java.util.concurrent.ThreadPoolExecutor$Worker.**runTask(**ThreadPoolExecutor.**java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

[2012-07-01 01:18:30,491][WARN ][index.engine.robin ] [Masaryk, Milos] [directory][4] failed to flush after setting shard to inactive
org.elasticsearch.index.engine.EngineClosedException: [directory][4] CurrentState[CLOSED]
at org.elasticsearch.index.engine.robin.RobinEngine.**flush(**RobinEngine.java:779)
at org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(**RobinEngine.**java:218)
at org.elasticsearch.indices.memory.**IndexingMemoryController$ShardsIndicesStatusChecker.**run(**IndexingMemoryController.**java:**178)
at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.**run(**ThreadPool.java:279)
at java.util.concurrent.Executors$RunnableAdapter.**call(**Executors.java:441)
at java.util.concurrent.FutureTask$Sync.**innerRunAndReset(**FutureTask.**java:317)
at java.util.concurrent.FutureTask.runAndReset(**FutureTask.java:**150)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$**101(ScheduledThreadPoolExecutor.**java:98)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.**runPeriodic(ScheduledThreadPoolExecutor.**java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.**java:204)
at java.util.concurrent.ThreadPoolExecutor$Worker.**runTask(**ThreadPoolExecutor.**java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

[2012-07-01 01:38:02,862][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][654868][1423] duration [3.4s], collections [1]/[1s], total [3.4s]/[28.9s], memory [104mb]->[104mb]/[1019.8mb]

[2012-07-01 02:37:16,178][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][656829][1425] duration [32.6s], collections [1]/[1.3m], total [32.6s]/[1m], memory [107.1mb]->[91.5mb]/[1019.8mb]

[2012-07-01 03:55:11,063][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][659670][1429] duration [12.8s], collections [1]/[37.9s], total [12.8s]/[1.2m], memory [110.4mb]->[96mb]/[1019.8mb]

[2012-07-01 04:55:08,395][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][661918][1432] duration [17.8s], collections [1]/[17.9s], total [17.8s]/[1.5m], memory [112.5mb]->[96.6mb]/[1019.8mb]

[2012-07-01 05:11:00,493][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][662567][1433] duration [1.5m], collections [1]/[1.5m], total [1.5m]/[3m], memory [113.1mb]->[99.5mb]/[1019.8mb]

[2012-07-01 06:06:28,152][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][664137][1436] duration [54.3s], collections [1]/[54.3s], total [54.3s]/[3.9m], memory [119.1mb]->[105.5mb]/[1019.8mb]

I use elasticsearch 0.19.1 on 4 machine. the weird thing found on elasticsearch-head that attacted with this topic.

Hi kimchy,

After I upgrade elasticSearch to 0.19.8 then test I again with this
scenario but the cluster not join together again.

I think it wait for something to trigger the rejoin process.

What do you think in my case?

On Tuesday, July 3, 2012 11:18:47 PM UTC+7, kimchy wrote:

In 0.19.5 and above we improved the logic of exactly the failure you are
seeing, might make sense to upgrade to the latest version.

On Tue, Jul 3, 2012 at 5:13 PM, Sicker <sick...@gmail.com <javascript:>>wrote:

All node are red and don't see each other, after restart all node, it's
now ok.

I think the network between each node down to long after network come
back. the node is not try to find the cluster again.

On Tuesday, July 3, 2012 7:50:52 PM UTC+7, Sicker wrote:

I found the problem with elasticSearch cluster on machine 1 found
timeout. the Elasticsearch log are below:

[2012-07-01 00:10:11,876][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [63911ms] ago, timed out [33911ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [652733]

[2012-07-01 00:10:11,877][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [33912ms] ago, timed out [0ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [652734]

[2012-07-01 00:12:08,150][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [50502ms] ago, timed out [20502ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [652782]

[2012-07-01 00:19:59,034][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [61803ms] ago, timed out [31803ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [652917]

[2012-07-01 00:19:59,034][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [31803ms] ago, timed out [1803ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [652918]

[2012-07-01 00:28:03,589][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [38953ms] ago, timed out [8953ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [653028]

[2012-07-01 00:34:13,351][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [68650ms] ago, timed out [38650ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [653034]

[2012-07-01 00:34:43,747][INFO ][discovery.zen ] [Masaryk, Milos] master_left [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]

[2012-07-01 00:36:34,455][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [38650ms] ago, timed out [8650ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [653035]

[2012-07-01 00:37:43,467][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [218766ms] ago, timed out [188766ms] ago, action [discovery/zen/fd/masterPing], node [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]], id [653036]

[2012-07-01 00:37:43,467][INFO ][cluster.service ] [Masaryk, Milos] master {new [Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]], previous [Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]}, removed {[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]],}, reason: zen-disco-master_failed ([Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]])

[2012-07-01 00:39:44,042][INFO ][discovery.zen ] [Masaryk, Milos] master_left [[Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]]], reason [no longer master]

[2012-07-01 00:39:44,043][INFO ][cluster.service ] [Masaryk, Milos] master {new [Masaryk, Milos][c6PfkH6uQ4-ACL4nacXAug]**[inet[/172.20.118.210:9300]], previous [Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]]}, removed {[Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]],}, reason: zen-disco-master_failed ([Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]])

[2012-07-01 00:40:50,896][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [65851ms] ago, timed out [11206ms] ago, action [discovery/zen/fd/ping], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][**inet[/172.20.118.215:9300]]], id [653041]

[2012-07-01 00:43:33,677][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [173984ms] ago, timed out [139279ms] ago, action [/cluster/nodes/indices/shard/**store/n], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][**inet[/172.20.118.215:9300]]], id [653044]

[2012-07-01 00:43:33,677][WARN ][transport ] [Masaryk, Milos] Received response for a request that has timed out, sent [207809ms] ago, timed out [173985ms] ago, action [/gateway/local/started-**shards/n], node [[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][**inet[/172.20.118.215:9300]]], id [653042]

[2012-07-01 00:44:37,569][WARN ][index.shard.service ] [Masaryk, Milos] [directory][3] suspect illegal state: trying to move shard from primary mode to backup mode

[2012-07-01 00:44:46,508][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 00:45:40,538][INFO ][cluster.service ] [Masaryk, Milos] removed {[Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][**inet[/172.20.118.215:9300]],}, reason: zen-disco-node_failed([Malus, Karl][mHn3YRK0SkCp88yUp2kqQw][**inet[/172.20.118.215:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout

[2012-07-01 00:45:48,543][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 00:58:32,684][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 00:58:45,582][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 01:00:01,147][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 01:00:01,338][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 01:00:01,360][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 01:03:54,205][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]]]

[2012-07-01 01:04:45,344][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]]]

[2012-07-01 01:04:45,345][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Vengeance][**F3kgDi9sTva4DGlw1whhrQ][inet[/**172.20.118.209:9300]]]

[2012-07-01 01:05:07,703][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][653683][1422] duration [16.7s], collections [1]/[17.5s], total [16.7s]/[25.5s], memory [102mb]->[87.7mb]/[1019.8mb]

[2012-07-01 01:09:09,634][WARN ][discovery.zen ] [Masaryk, Milos] master should not receive new cluster state from [[Sunfire][**T7DyaW6PSuWONZK4NNw3Bw][inet[/**172.20.118.208:9300]]]

[2012-07-01 01:18:30,490][WARN ][index.engine.robin ] [Masaryk, Milos] [directory][1] failed to flush after setting shard to inactive
org.elasticsearch.index.**engine.EngineClosedException: [directory][1] CurrentState[CLOSED]
at org.elasticsearch.index.**engine.robin.RobinEngine.**flush(RobinEngine.java:779)
at org.elasticsearch.index.**engine.robin.RobinEngine.**updateIndexingBufferSize(**RobinEngine.java:218)
at org.elasticsearch.indices.**memory.**IndexingMemoryController$**ShardsIndicesStatusChecker.**run(IndexingMemoryController.**java:178)
at org.elasticsearch.threadpool.**ThreadPool$LoggingRunnable.**run(ThreadPool.java:279)
at java.util.concurrent.**Executors$RunnableAdapter.**call(Executors.java:441)
at java.util.concurrent.**FutureTask$Sync.**innerRunAndReset(FutureTask.**java:317)
at java.util.concurrent.**FutureTask.runAndReset(**FutureTask.java:150)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.access$**101(**ScheduledThreadPoolExecutor.**java:98)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.**runPeriodic(**ScheduledThreadPoolExecutor.**java:180)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.run(**ScheduledThreadPoolExecutor.**java:204)
at java.util.concurrent.**ThreadPoolExecutor$Worker.**runTask(ThreadPoolExecutor.**java:886)
at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.**java:662)

[2012-07-01 01:18:30,491][WARN ][index.engine.robin ] [Masaryk, Milos] [directory][4] failed to flush after setting shard to inactive
org.elasticsearch.index.**engine.EngineClosedException: [directory][4] CurrentState[CLOSED]
at org.elasticsearch.index.**engine.robin.RobinEngine.**flush(RobinEngine.java:779)
at org.elasticsearch.index.**engine.robin.RobinEngine.**updateIndexingBufferSize(**RobinEngine.java:218)
at org.elasticsearch.indices.**memory.**IndexingMemoryController$**ShardsIndicesStatusChecker.**run(IndexingMemoryController.**java:178)
at org.elasticsearch.threadpool.**ThreadPool$LoggingRunnable.**run(ThreadPool.java:279)
at java.util.concurrent.**Executors$RunnableAdapter.**call(Executors.java:441)
at java.util.concurrent.**FutureTask$Sync.**innerRunAndReset(FutureTask.**java:317)
at java.util.concurrent.**FutureTask.runAndReset(**FutureTask.java:150)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.access$**101(**ScheduledThreadPoolExecutor.**java:98)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.**runPeriodic(**ScheduledThreadPoolExecutor.**java:180)
at java.util.concurrent.**ScheduledThreadPoolExecutor$**ScheduledFutureTask.run(**ScheduledThreadPoolExecutor.**java:204)
at java.util.concurrent.**ThreadPoolExecutor$Worker.**runTask(ThreadPoolExecutor.**java:886)
at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.**java:662)

[2012-07-01 01:38:02,862][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][654868][1423] duration [3.4s], collections [1]/[1s], total [3.4s]/[28.9s], memory [104mb]->[104mb]/[1019.8mb]

[2012-07-01 02:37:16,178][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][656829][1425] duration [32.6s], collections [1]/[1.3m], total [32.6s]/[1m], memory [107.1mb]->[91.5mb]/[1019.8mb]

[2012-07-01 03:55:11,063][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][659670][1429] duration [12.8s], collections [1]/[37.9s], total [12.8s]/[1.2m], memory [110.4mb]->[96mb]/[1019.8mb]

[2012-07-01 04:55:08,395][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][661918][1432] duration [17.8s], collections [1]/[17.9s], total [17.8s]/[1.5m], memory [112.5mb]->[96.6mb]/[1019.8mb]

[2012-07-01 05:11:00,493][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][662567][1433] duration [1.5m], collections [1]/[1.5m], total [1.5m]/[3m], memory [113.1mb]->[99.5mb]/[1019.8mb]

[2012-07-01 06:06:28,152][WARN ][monitor.jvm ] [Masaryk, Milos] [gc][ParNew][664137][1436] duration [54.3s], collections [1]/[54.3s], total [54.3s]/[3.9m], memory [119.1mb]->[105.5mb]/[1019.**8mb]

I use elasticsearch 0.19.1 on 4 machine. the weird thing found on elasticsearch-head that attacted with this topic.

--