Hi Shay,
I experienced some strange behaviour, but "all is good" at the end
Thats the story::
just indexed 17 M documents with default index, after that invoked
optimizing index:
$ time curl -XPOST '
http://localhost:9200/hbz/_optimize?max_num_segments=1&refresh=true&flush=true
'
{"ok":true,"_shards":{"total":10,"successful":7,"failed":3,"failures":[{"index":"hbz","shard":2,"reason":"BroadcastShardOperationFailedException[[hbz][2]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":3,"reason":"BroadcastShardOperationFailedException[[hbz][3]
]; nested: NodeDisconnectedException[[Dionysus][inet[/127.0.0.1:9300]][indices/optimize/shard]];
"},{"index":"hbz","shard":0,"reason":"BroadcastShardOperationFailedException[[hbz][0]
]; nested: NodeDisconnectedException[[Evilhawk][inet[/127.0.0.1:9301]][indices/optimize/shard]];
"}]}}
real 213m55.301s
So I looked at elasticsearch-logs:
INFO: [Archangel] detected_master
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],
added {[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300
]],[Evilhawk][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]])
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] elasticsearch/523d85c5-a95a-40f2-81cf-d0e9d7143e59
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address {inet[/127.0.0.1:9200]}, publish_address
{inet[localhost/127.0.0.1:9200]}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] bound_address
{service:jmx:rmi:///jndi/rmi://:9402/jmxrmi}, publish_address
{service:jmx:rmi:///jndi/rmi://127.0.0.1:9402/jmxrmi}
05.08.2010 17:17:04 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] {elasticsearch/0.0.0}[32085]: started
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:33:48 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Evilhawk][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
previous
[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]},
removed
{[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]],},
reason: zen-disco-master_failed
([Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]])
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master_left
[[Evilhawk][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
06.08.2010 13:50:19 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Archangel] master {new
[Archangel][523d85c5-a95a-40f2-81cf-d0e9d7143e59][inet[localhost/127.0.0.1:9302]]{data=false},
previous
[Evilhawk][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]},
removed
{[Evilhawk][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-master_failed
([Evilhawk][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]])
06.08.2010 13:50:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]]
06.08.2010 14:29:44 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Evilhawk][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]]
06.08.2010 14:31:21 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalWarn
WARNUNG: [Archangel] master should not receive new cluster state from
[[Dionysus][bfc8bf74-7356-48c7-abdc-904ec3280e6d][inet[/127.0.0.1:9300]]]
(Curiously, the nodes said good bye at 13:50 and said hello again ("new
cluster state") 40 minutes later.)
Of course I couldnot curl anything - the datanodes were not connectable.
So I killed Archangel (the non-dataNode, now beeing the master) and brought
up a new non-dataNode:
WARNUNG: [Storm, Johnny] received ping response with no matching id [1]
It seems Johnny is also not willing to join with my dataNodes
So I brought one of my two data-nodes down, forcing a snapshotting, to save
some data for later rebuilding of the index (I had started the data nodes
with es.index.gateway.snapshot_interval=-1 ).
I killed Johnny (<=>master) and brought up a new non-dataNode:
06.08.2010 15:30:25 org.elasticsearch.common.logging.slf4j.Slf4jESLogger
internalInfo
INFO: [Vulcan] detected_master
[Evilhawk][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],
added
{[Evilhawk][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]],},
reason: zen-disco-receive(from
[[Evilhawk][20b6f742-98e7-4d7e-af57-b45d7d2ed4e8][inet[/127.0.0.1:9301]]])
You see, curiously, now the data-node was detected! "status" is fine and the
total hits is 17.242.864, meaning no document is lost
( I discovered the same problem later, doing something different: when
recovering that snapshot, starting a non-datanode first brings "received
ping response with no matching id", waited => nothing, killed it and started
an new one => sucessful connection. It seems to have to do with lots of data
(48Gb Index) and thus a heavy load (recovering that snapshot takes 1 hour)
on the machine).
A newly "optimize" was done in 0.02ms so it doesnot seem that the first
optimize-run was faulty, and as said, all documents are there.
oo