NodeDisconnectedException when using optimize ; "received ping response with no matching id"


(Ein Droid) #1

Hi Shay,

I experienced some strange behaviour, but "all is good" at the end :slight_smile:
Thats the story::

just indexed 17 M documents with default index, after that invoked
optimizing index:

$ time curl -XPOST '
http://localhost:9200/hbz/_optimize?max_num_segments=1&refresh=true&flush=true
'
{"ok":true,"_shards":{"total":10,"successful":7,"failed":3,"failures":[{"index":"hbz","shard":2,"reason":"BroadcastShardOperationFailedException[[hbz][2]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":3,"reason":"BroadcastShardOperationFailedException[[hbz][3]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":0,"reason":"BroadcastShardOperationFailedException[[hbz][0]
]; nested: NodeDisconnectedException[[Evilhawk‎][inet[/127.0.0.1:9301]][indices/optimize/shard]];
"}]}}
real 213m55.301s

So I looked at elasticsearch-logs:

INFO: [Archangel] detected_master
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],
added {[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]],[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]])
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] elasticsearch/523d85c5-a95a-40f2-81cf-d0e9d7143e59
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address {inet[/127.0.0.1:9200]}, publish_address
{inet[localhost/127.0.0.1:9200]}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address
{service:jmx:rmi:///jndi/rmi://:9402/jmxrmi}, publish_address
{service:jmx:rmi:///jndi/rmi://127.0.0.1:9402/jmxrmi}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] {elasticsearch/0.0.0}[32085]: started
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
previous
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]},
removed
{[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],},
reason: zen-disco-master_failed
([Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]])
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Archangel][523d85c5-a95a-40f2-81cf-d0e9d7143e59][inet[localhost/127.0.0.1:9302]]{data=false},
previous
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]},
removed
{[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-master_failed
([Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]])
06.08.2010 13:50:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]]
06.08.2010 14:29:44 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]]
06.08.2010 14:31:21 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]]

(Curiously, the nodes said good bye at 13:50 and said hello again ("new
cluster state") 40 minutes later.)
Of course I couldnot curl anything - the datanodes were not connectable.

So I killed Archangel (the non-dataNode, now beeing the master) and brought
up a new non-dataNode:

WARNUNG: [Storm, Johnny] received ping response with no matching id [1]

It seems Johnny is also not willing to join with my dataNodes :frowning:

So I brought one of my two data-nodes down, forcing a snapshotting, to save
some data for later rebuilding of the index (I had started the data nodes
with es.index.gateway.snapshot_interval=-1 ).

I killed Johnny (<=>master) and brought up a new non-dataNode:

06.08.2010 15:30:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Vulcan] detected_master
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
added
{[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]])

You see, curiously, now the data-node was detected! "status" is fine and the
total hits is 17.242.864, meaning no document is lost :slight_smile:

( I discovered the same problem later, doing something different: when
recovering that snapshot, starting a non-datanode first brings "received
ping response with no matching id", waited => nothing, killed it and started
an new one => sucessful connection. It seems to have to do with lots of data
(48Gb Index) and thus a heavy load (recovering that snapshot takes 1 hour)
on the machine).

A newly "optimize" was done in 0.02ms so it doesnot seem that the first
optimize-run was faulty, and as said, all documents are there.

oo


(Shay Banon) #2

Hi,

The optimization process was probably pretty heavy which caused the node
to be detected as failed. Can you post the Dionysus node logs?

-shay.banon

On Fri, Aug 6, 2010 at 6:56 PM, Ein Droid ein.droid@googlemail.com wrote:

Hi Shay,

I experienced some strange behaviour, but "all is good" at the end :slight_smile:
Thats the story::

just indexed 17 M documents with default index, after that invoked
optimizing index:

$ time curl -XPOST '
http://localhost:9200/hbz/_optimize?max_num_segments=1&refresh=true&flush=true
'
{"ok":true,"_shards":{"total":10,"successful":7,"failed":3,"failures":[{"index":"hbz","shard":2,"reason":"BroadcastShardOperationFailedException[[hbz][2]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":3,"reason":"BroadcastShardOperationFailedException[[hbz][3]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":0,"reason":"BroadcastShardOperationFailedException[[hbz][0]
]; nested: NodeDisconnectedException[[Evilhawk‎][inet[/127.0.0.1:9301]][indices/optimize/shard]];
"}]}}
real 213m55.301s

So I looked at elasticsearch-logs:

INFO: [Archangel] detected_master
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],
added {[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]],[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]])
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] elasticsearch/523d85c5-a95a-40f2-81cf-d0e9d7143e59
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address {inet[/127.0.0.1:9200]}, publish_address
{inet[localhost/127.0.0.1:9200]}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address
{service:jmx:rmi:///jndi/rmi://:9402/jmxrmi}, publish_address
{service:jmx:rmi:///jndi/rmi://127.0.0.1:9402/jmxrmi}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] {elasticsearch/0.0.0}[32085]: started
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
previous
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]},
removed
{[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],},
reason: zen-disco-master_failed
([Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]])
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Archangel][523d85c5-a95a-40f2-81cf-d0e9d7143e59][inet[localhost/127.0.0.1:9302]]{data=false},
previous
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]},
removed
{[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-master_failed
([Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]])
06.08.2010 13:50:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]]
06.08.2010 14:29:44 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]]
06.08.2010 14:31:21 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]]

(Curiously, the nodes said good bye at 13:50 and said hello again ("new
cluster state") 40 minutes later.)
Of course I couldnot curl anything - the datanodes were not connectable.

So I killed Archangel (the non-dataNode, now beeing the master) and brought
up a new non-dataNode:

WARNUNG: [Storm, Johnny] received ping response with no matching id [1]

It seems Johnny is also not willing to join with my dataNodes :frowning:

So I brought one of my two data-nodes down, forcing a snapshotting, to save
some data for later rebuilding of the index (I had started the data nodes
with es.index.gateway.snapshot_interval=-1 ).

I killed Johnny (<=>master) and brought up a new non-dataNode:

06.08.2010 15:30:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Vulcan] detected_master
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
added
{[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301
]]])

You see, curiously, now the data-node was detected! "status" is fine and
the total hits is 17.242.864, meaning no document is lost :slight_smile:

( I discovered the same problem later, doing something different: when
recovering that snapshot, starting a non-datanode first brings "received
ping response with no matching id", waited => nothing, killed it and started
an new one => sucessful connection. It seems to have to do with lots of data
(48Gb Index) and thus a heavy load (recovering that snapshot takes 1 hour)
on the machine).

A newly "optimize" was done in 0.02ms so it doesnot seem that the first
optimize-run was faulty, and as said, all documents are there.

oo


(Ein Droid) #3

Hi Shay,

I don't have these logs anymore. Did the whole procedure again, having no
trouble this time. The optimizing was nearly 2xfaster , so at the time I
had the problems there was presumably some other heavy work done on the
machine - it may be that there were other nodes started and shutdown on the
subnet. Sorry that I lost the logs.

oo

On Sun, Aug 8, 2010 at 9:53 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi,

The optimization process was probably pretty heavy which caused the node
to be detected as failed. Can you post the Dionysus node logs?

-shay.banon

On Fri, Aug 6, 2010 at 6:56 PM, Ein Droid ein.droid@googlemail.comwrote:

Hi Shay,

I experienced some strange behaviour, but "all is good" at the end :slight_smile:
Thats the story::

just indexed 17 M documents with default index, after that invoked
optimizing index:

$ time curl -XPOST '
http://localhost:9200/hbz/_optimize?max_num_segments=1&refresh=true&flush=true
'
{"ok":true,"_shards":{"total":10,"successful":7,"failed":3,"failures":[{"index":"hbz","shard":2,"reason":"BroadcastShardOperationFailedException[[hbz][2]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":3,"reason":"BroadcastShardOperationFailedException[[hbz][3]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":0,"reason":"BroadcastShardOperationFailedException[[hbz][0]
]; nested: NodeDisconnectedException[[Evilhawk‎][inet[/127.0.0.1:9301]][indices/optimize/shard]];
"}]}}
real 213m55.301s

So I looked at elasticsearch-logs:

INFO: [Archangel] detected_master
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],
added {[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]],[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]]])
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] elasticsearch/523d85c5-a95a-40f2-81cf-d0e9d7143e59
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address {inet[/127.0.0.1:9200]}, publish_address
{inet[localhost/127.0.0.1:9200]}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address
{service:jmx:rmi:///jndi/rmi://:9402/jmxrmi}, publish_address
{service:jmx:rmi:///jndi/rmi://127.0.0.1:9402/jmxrmi}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] {elasticsearch/0.0.0}[32085]: started
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
previous
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]},
removed
{[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],},
reason: zen-disco-master_failed
([Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]])
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Archangel][523d85c5-a95a-40f2-81cf-d0e9d7143e59][inet[localhost/127.0.0.1:9302]]{data=false},
previous
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]},
removed
{[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-master_failed
([Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301
]])
06.08.2010 13:50:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]]
06.08.2010 14:29:44 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301
]]]
06.08.2010 14:31:21 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]]

(Curiously, the nodes said good bye at 13:50 and said hello again ("new
cluster state") 40 minutes later.)
Of course I couldnot curl anything - the datanodes were not connectable.

So I killed Archangel (the non-dataNode, now beeing the master) and
brought up a new non-dataNode:

WARNUNG: [Storm, Johnny] received ping response with no matching id [1]

It seems Johnny is also not willing to join with my dataNodes :frowning:

So I brought one of my two data-nodes down, forcing a snapshotting, to
save some data for later rebuilding of the index (I had started the data
nodes with es.index.gateway.snapshot_interval=-1 ).

I killed Johnny (<=>master) and brought up a new non-dataNode:

06.08.2010 15:30:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Vulcan] detected_master
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
added
{[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301
]]])

You see, curiously, now the data-node was detected! "status" is fine and
the total hits is 17.242.864, meaning no document is lost :slight_smile:

( I discovered the same problem later, doing something different: when
recovering that snapshot, starting a non-datanode first brings "received
ping response with no matching id", waited => nothing, killed it and started
an new one => sucessful connection. It seems to have to do with lots of data
(48Gb Index) and thus a heavy load (recovering that snapshot takes 1 hour)
on the machine).

A newly "optimize" was done in 0.02ms so it doesnot seem that the first
optimize-run was faulty, and as said, all documents are there.

oo


(Shay Banon) #4

No problem. If you it happens next time, it would be help since I want to
know if the fault detection kicked in because the socket got closed or
because the node was not reachable over TCP ping.

-shay.banon

On Fri, Aug 13, 2010 at 2:29 PM, Ein Droid ein.droid@googlemail.com wrote:

Hi Shay,

I don't have these logs anymore. Did the whole procedure again, having no
trouble this time. The optimizing was nearly 2xfaster , so at the time I
had the problems there was presumably some other heavy work done on the
machine - it may be that there were other nodes started and shutdown on the
subnet. Sorry that I lost the logs.

oo

On Sun, Aug 8, 2010 at 9:53 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hi,

The optimization process was probably pretty heavy which caused the
node to be detected as failed. Can you post the Dionysus node logs?

-shay.banon

On Fri, Aug 6, 2010 at 6:56 PM, Ein Droid ein.droid@googlemail.comwrote:

Hi Shay,

I experienced some strange behaviour, but "all is good" at the end :slight_smile:
Thats the story::

just indexed 17 M documents with default index, after that invoked
optimizing index:

$ time curl -XPOST '
http://localhost:9200/hbz/_optimize?max_num_segments=1&refresh=true&flush=true
'
{"ok":true,"_shards":{"total":10,"successful":7,"failed":3,"failures":[{"index":"hbz","shard":2,"reason":"BroadcastShardOperationFailedException[[hbz][2]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":3,"reason":"BroadcastShardOperationFailedException[[hbz][3]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":0,"reason":"BroadcastShardOperationFailedException[[hbz][0]
]; nested: NodeDisconnectedException[[Evilhawk‎][inet[/127.0.0.1:9301]][indices/optimize/shard]];
"}]}}
real 213m55.301s

So I looked at elasticsearch-logs:

INFO: [Archangel] detected_master
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],
added {[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]],[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]]])
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] elasticsearch/523d85c5-a95a-40f2-81cf-d0e9d7143e59
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address {inet[/127.0.0.1:9200]}, publish_address
{inet[localhost/127.0.0.1:9200]}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address
{service:jmx:rmi:///jndi/rmi://:9402/jmxrmi}, publish_address
{service:jmx:rmi:///jndi/rmi://127.0.0.1:9402/jmxrmi}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] {elasticsearch/0.0.0}[32085]: started
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
previous
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]},
removed
{[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],},
reason: zen-disco-master_failed
([Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]])
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Archangel][523d85c5-a95a-40f2-81cf-d0e9d7143e59][inet[localhost/127.0.0.1:9302]]{data=false},
previous
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]},
removed
{[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-master_failed
([Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301
]])
06.08.2010 13:50:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]]]
06.08.2010 14:29:44 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301
]]]
06.08.2010 14:31:21 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]]]

(Curiously, the nodes said good bye at 13:50 and said hello again ("new
cluster state") 40 minutes later.)
Of course I couldnot curl anything - the datanodes were not connectable.

So I killed Archangel (the non-dataNode, now beeing the master) and
brought up a new non-dataNode:

WARNUNG: [Storm, Johnny] received ping response with no matching id [1]

It seems Johnny is also not willing to join with my dataNodes :frowning:

So I brought one of my two data-nodes down, forcing a snapshotting, to
save some data for later rebuilding of the index (I had started the data
nodes with es.index.gateway.snapshot_interval=-1 ).

I killed Johnny (<=>master) and brought up a new non-dataNode:

06.08.2010 15:30:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Vulcan] detected_master
[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
added
{[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Evilhawk‎][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301
]]])

You see, curiously, now the data-node was detected! "status" is fine and
the total hits is 17.242.864, meaning no document is lost :slight_smile:

( I discovered the same problem later, doing something different: when
recovering that snapshot, starting a non-datanode first brings "received
ping response with no matching id", waited => nothing, killed it and started
an new one => sucessful connection. It seems to have to do with lots of data
(48Gb Index) and thus a heavy load (recovering that snapshot takes 1 hour)
on the machine).

A newly "optimize" was done in 0.02ms so it doesnot seem that the first
optimize-run was faulty, and as said, all documents are there.

oo


(system) #5