NodeDisconnectedException when using optimize ; "received ping response with no matching id"

Hi Shay,

I experienced some strange behaviour, but "all is good" at the end :slight_smile:
Thats the story::

just indexed 17 M documents with default index, after that invoked
optimizing index:

$ time curl -XPOST '
http://localhost:9200/hbz/_optimize?max_num_segments=1&refresh=true&flush=true
'
{"ok":true,"_shards":{"total":10,"successful":7,"failed":3,"failures":[{"index":"hbz","shard":2,"reason":"BroadcastShardOperationFailedException[[hbz][2]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":3,"reason":"BroadcastShardOperationFailedException[[hbz][3]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":0,"reason":"BroadcastShardOperationFailedException[[hbz][0]
]; nested: NodeDisconnectedException[[Evilhawk‎][inet[/127.0.0.1:9301]][indices/optimize/shard]];
"}]}}
real 213m55.301s

So I looked at elasticsearch-logs:

INFO: [Archangel] detected_master
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],
added {[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]],[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]])
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] elasticsearch/523d85c5-a95a-40f2-81cf-d0e9d7143e59
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address {inet[/127.0.0.1:9200]}, publish_address
{inet[localhost/127.0.0.1:9200]}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address
{service:jmx:rmi:///jndi/rmi://:9402/jmxrmi}, publish_address
{service:jmx:rmi:///jndi/rmi://127.0.0.1:9402/jmxrmi}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] {elasticsearch/0.0.0}[32085]: started
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
previous
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]},
removed
{[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],},
reason: zen-disco-master_failed
([Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]])
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Archangel][523d85c5-a95a-40f2-81cf-d0e9d7143e59][inet[localhost/127.0.0.1:9302]]{data=false},
previous
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]},
removed
{[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-master_failed
([Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]])
06.08.2010 13:50:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]]
06.08.2010 14:29:44 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]]
06.08.2010 14:31:21 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]]

(Curiously, the nodes said good bye at 13:50 and said hello again ("new
cluster state") 40 minutes later.)
Of course I couldnot curl anything - the datanodes were not connectable.

So I killed Archangel (the non-dataNode, now beeing the master) and brought
up a new non-dataNode:

WARNUNG: [Storm, Johnny] received ping response with no matching id [1]

It seems Johnny is also not willing to join with my dataNodes :frowning:

So I brought one of my two data-nodes down, forcing a snapshotting, to save
some data for later rebuilding of the index (I had started the data nodes
with es.index.gateway.snapshot_interval=-1 ).

I killed Johnny (<=>master) and brought up a new non-dataNode:

06.08.2010 15:30:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Vulcan] detected_master
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
added
{[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]])

You see, curiously, now the data-node was detected! "status" is fine and the
total hits is 17.242.864, meaning no document is lost :slight_smile:

( I discovered the same problem later, doing something different: when
recovering that snapshot, starting a non-datanode first brings "received
ping response with no matching id", waited => nothing, killed it and started
an new one => sucessful connection. It seems to have to do with lots of data
(48Gb Index) and thus a heavy load (recovering that snapshot takes 1 hour)
on the machine).

A newly "optimize" was done in 0.02ms so it doesnot seem that the first
optimize-run was faulty, and as said, all documents are there.

oo

Hi,

The optimization process was probably pretty heavy which caused the node
to be detected as failed. Can you post the Dionysus node logs?

-shay.banon

On Fri, Aug 6, 2010 at 6:56 PM, Ein Droid ein.droid@googlemail.com wrote:

Hi Shay,

I experienced some strange behaviour, but "all is good" at the end :slight_smile:
Thats the story::

just indexed 17 M documents with default index, after that invoked
optimizing index:

$ time curl -XPOST '
http://localhost:9200/hbz/_optimize?max_num_segments=1&refresh=true&flush=true
'
{"ok":true,"_shards":{"total":10,"successful":7,"failed":3,"failures":[{"index":"hbz","shard":2,"reason":"BroadcastShardOperationFailedException[[hbz][2]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":3,"reason":"BroadcastShardOperationFailedException[[hbz][3]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":0,"reason":"BroadcastShardOperationFailedException[[hbz][0]
]; nested: NodeDisconnectedException[[Evilhawk‎][inet[/127.0.0.1:9301]][indices/optimize/shard]];
"}]}}
real 213m55.301s

So I looked at elasticsearch-logs:

INFO: [Archangel] detected_master
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],
added {[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]],[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]])
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] elasticsearch/523d85c5-a95a-40f2-81cf-d0e9d7143e59
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address {inet[/127.0.0.1:9200]}, publish_address
{inet[localhost/127.0.0.1:9200]}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address
{service:jmx:rmi:///jndi/rmi://:9402/jmxrmi}, publish_address
{service:jmx:rmi:///jndi/rmi://127.0.0.1:9402/jmxrmi}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] {elasticsearch/0.0.0}[32085]: started
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
previous
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]},
removed
{[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],},
reason: zen-disco-master_failed
([Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]])
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Archangel][523d85c5-a95a-40f2-81cf-d0e9d7143e59][inet[localhost/127.0.0.1:9302]]{data=false},
previous
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]},
removed
{[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-master_failed
([Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]])
06.08.2010 13:50:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]]
06.08.2010 14:29:44 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]]
06.08.2010 14:31:21 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]]

(Curiously, the nodes said good bye at 13:50 and said hello again ("new
cluster state") 40 minutes later.)
Of course I couldnot curl anything - the datanodes were not connectable.

So I killed Archangel (the non-dataNode, now beeing the master) and brought
up a new non-dataNode:

WARNUNG: [Storm, Johnny] received ping response with no matching id [1]

It seems Johnny is also not willing to join with my dataNodes :frowning:

So I brought one of my two data-nodes down, forcing a snapshotting, to save
some data for later rebuilding of the index (I had started the data nodes
with es.index.gateway.snapshot_interval=-1 ).

I killed Johnny (<=>master) and brought up a new non-dataNode:

06.08.2010 15:30:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Vulcan] detected_master
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
added
{[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301
]]])

You see, curiously, now the data-node was detected! "status" is fine and
the total hits is 17.242.864, meaning no document is lost :slight_smile:

( I discovered the same problem later, doing something different: when
recovering that snapshot, starting a non-datanode first brings "received
ping response with no matching id", waited => nothing, killed it and started
an new one => sucessful connection. It seems to have to do with lots of data
(48Gb Index) and thus a heavy load (recovering that snapshot takes 1 hour)
on the machine).

A newly "optimize" was done in 0.02ms so it doesnot seem that the first
optimize-run was faulty, and as said, all documents are there.

oo

Hi Shay,

I don't have these logs anymore. Did the whole procedure again, having no
trouble this time. The optimizing was nearly 2xfaster , so at the time I
had the problems there was presumably some other heavy work done on the
machine - it may be that there were other nodes started and shutdown on the
subnet. Sorry that I lost the logs.

oo

On Sun, Aug 8, 2010 at 9:53 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi,

The optimization process was probably pretty heavy which caused the node
to be detected as failed. Can you post the Dionysus node logs?

-shay.banon

On Fri, Aug 6, 2010 at 6:56 PM, Ein Droid ein.droid@googlemail.comwrote:

Hi Shay,

I experienced some strange behaviour, but "all is good" at the end :slight_smile:
Thats the story::

just indexed 17 M documents with default index, after that invoked
optimizing index:

$ time curl -XPOST '
http://localhost:9200/hbz/_optimize?max_num_segments=1&refresh=true&flush=true
'
{"ok":true,"_shards":{"total":10,"successful":7,"failed":3,"failures":[{"index":"hbz","shard":2,"reason":"BroadcastShardOperationFailedException[[hbz][2]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":3,"reason":"BroadcastShardOperationFailedException[[hbz][3]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":0,"reason":"BroadcastShardOperationFailedException[[hbz][0]
]; nested: NodeDisconnectedException[[Evilhawk‎][inet[/127.0.0.1:9301]][indices/optimize/shard]];
"}]}}
real 213m55.301s

So I looked at elasticsearch-logs:

INFO: [Archangel] detected_master
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],
added {[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]],[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]]])
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] elasticsearch/523d85c5-a95a-40f2-81cf-d0e9d7143e59
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address {inet[/127.0.0.1:9200]}, publish_address
{inet[localhost/127.0.0.1:9200]}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address
{service:jmx:rmi:///jndi/rmi://:9402/jmxrmi}, publish_address
{service:jmx:rmi:///jndi/rmi://127.0.0.1:9402/jmxrmi}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] {elasticsearch/0.0.0}[32085]: started
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
previous
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]},
removed
{[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],},
reason: zen-disco-master_failed
([Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]])
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Archangel][523d85c5-a95a-40f2-81cf-d0e9d7143e59][inet[localhost/127.0.0.1:9302]]{data=false},
previous
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]},
removed
{[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-master_failed
([Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301
]])
06.08.2010 13:50:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]]
06.08.2010 14:29:44 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301
]]]
06.08.2010 14:31:21 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]]

(Curiously, the nodes said good bye at 13:50 and said hello again ("new
cluster state") 40 minutes later.)
Of course I couldnot curl anything - the datanodes were not connectable.

So I killed Archangel (the non-dataNode, now beeing the master) and
brought up a new non-dataNode:

WARNUNG: [Storm, Johnny] received ping response with no matching id [1]

It seems Johnny is also not willing to join with my dataNodes :frowning:

So I brought one of my two data-nodes down, forcing a snapshotting, to
save some data for later rebuilding of the index (I had started the data
nodes with es.index.gateway.snapshot_interval=-1 ).

I killed Johnny (<=>master) and brought up a new non-dataNode:

06.08.2010 15:30:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Vulcan] detected_master
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
added
{[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301
]]])

You see, curiously, now the data-node was detected! "status" is fine and
the total hits is 17.242.864, meaning no document is lost :slight_smile:

( I discovered the same problem later, doing something different: when
recovering that snapshot, starting a non-datanode first brings "received
ping response with no matching id", waited => nothing, killed it and started
an new one => sucessful connection. It seems to have to do with lots of data
(48Gb Index) and thus a heavy load (recovering that snapshot takes 1 hour)
on the machine).

A newly "optimize" was done in 0.02ms so it doesnot seem that the first
optimize-run was faulty, and as said, all documents are there.

oo

No problem. If you it happens next time, it would be help since I want to
know if the fault detection kicked in because the socket got closed or
because the node was not reachable over TCP ping.

-shay.banon

On Fri, Aug 13, 2010 at 2:29 PM, Ein Droid ein.droid@googlemail.com wrote:

Hi Shay,

I don't have these logs anymore. Did the whole procedure again, having no
trouble this time. The optimizing was nearly 2xfaster , so at the time I
had the problems there was presumably some other heavy work done on the
machine - it may be that there were other nodes started and shutdown on the
subnet. Sorry that I lost the logs.

oo

On Sun, Aug 8, 2010 at 9:53 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi,

The optimization process was probably pretty heavy which caused the
node to be detected as failed. Can you post the Dionysus node logs?

-shay.banon

On Fri, Aug 6, 2010 at 6:56 PM, Ein Droid ein.droid@googlemail.comwrote:

Hi Shay,

I experienced some strange behaviour, but "all is good" at the end :slight_smile:
Thats the story::

just indexed 17 M documents with default index, after that invoked
optimizing index:

$ time curl -XPOST '
http://localhost:9200/hbz/_optimize?max_num_segments=1&refresh=true&flush=true
'
{"ok":true,"_shards":{"total":10,"successful":7,"failed":3,"failures":[{"index":"hbz","shard":2,"reason":"BroadcastShardOperationFailedException[[hbz][2]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":3,"reason":"BroadcastShardOperationFailedException[[hbz][3]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":0,"reason":"BroadcastShardOperationFailedException[[hbz][0]
]; nested: NodeDisconnectedException[[Evilhawk‎][inet[/127.0.0.1:9301]][indices/optimize/shard]];
"}]}}
real 213m55.301s

So I looked at elasticsearch-logs:

INFO: [Archangel] detected_master
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],
added {[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]],[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]]])
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] elasticsearch/523d85c5-a95a-40f2-81cf-d0e9d7143e59
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address {inet[/127.0.0.1:9200]}, publish_address
{inet[localhost/127.0.0.1:9200]}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address
{service:jmx:rmi:///jndi/rmi://:9402/jmxrmi}, publish_address
{service:jmx:rmi:///jndi/rmi://127.0.0.1:9402/jmxrmi}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] {elasticsearch/0.0.0}[32085]: started
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
previous
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]},
removed
{[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],},
reason: zen-disco-master_failed
([Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]])
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Archangel][523d85c5-a95a-40f2-81cf-d0e9d7143e59][inet[localhost/127.0.0.1:9302]]{data=false},
previous
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]},
removed
{[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-master_failed
([Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301
]])
06.08.2010 13:50:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]]]
06.08.2010 14:29:44 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301
]]]
06.08.2010 14:31:21 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]]]

(Curiously, the nodes said good bye at 13:50 and said hello again ("new
cluster state") 40 minutes later.)
Of course I couldnot curl anything - the datanodes were not connectable.

So I killed Archangel (the non-dataNode, now beeing the master) and
brought up a new non-dataNode:

WARNUNG: [Storm, Johnny] received ping response with no matching id [1]

It seems Johnny is also not willing to join with my dataNodes :frowning:

So I brought one of my two data-nodes down, forcing a snapshotting, to
save some data for later rebuilding of the index (I had started the data
nodes with es.index.gateway.snapshot_interval=-1 ).

I killed Johnny (<=>master) and brought up a new non-dataNode:

06.08.2010 15:30:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Vulcan] detected_master
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
added
{[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301
]]])

You see, curiously, now the data-node was detected! "status" is fine and
the total hits is 17.242.864, meaning no document is lost :slight_smile:

( I discovered the same problem later, doing something different: when
recovering that snapshot, starting a non-datanode first brings "received
ping response with no matching id", waited => nothing, killed it and started
an new one => sucessful connection. It seems to have to do with lots of data
(48Gb Index) and thus a heavy load (recovering that snapshot takes 1 hour)
on the machine).

A newly "optimize" was done in 0.02ms so it doesnot seem that the first
optimize-run was faulty, and as said, all documents are there.

oo