Problem after rebooting cluster


(Barak Yaish) #1

Hello,

My ES cluster is composed of 2 machines. After indexing 100M docs, I
took the cluster down (ctrl-c), and boot it up again. Seems like each
machine recovered it indices:

[2011-02-03 11:36:43,233][DEBUG][index.shard.service ] [Zarrko,
the Tomorrow Man] [8][9] state: [CREATED]
[2011-02-03 11:36:43,235][DEBUG][index.shard.service ] [Zarrko,
the Tomorrow Man] [8][8] state: [RECOVERING]->[STARTED], reason [post
recovery]
[2011-02-03 11:36:43,235][DEBUG][index.shard.service ] [Zarrko,
the Tomorrow Man] [8][8] Scheduling refresher every 1s
[2011-02-03 11:36:43,235][DEBUG][index.gateway ] [Zarrko,
the Tomorrow Man] [8][8] recovery completed from local, took [21ms]
index : files [0] with total_size [0b], took[1ms]
: recovered_files [0] with total_size [0b]
: reusing_files [0] with total_size [0b]

But something is wrong. The 2 ES instances didn't join to each other:

[2011-02-03 11:32:34,764][WARN ][discovery.zen.ping.multicast]
[Moondark] received ping response with no matching id [1]
[2011-02-03 11:32:34,764][WARN ][discovery.zen.ping.multicast]
[Moondark] received ping response with no matching id [1]

And when trying to query ES, only one server answered, the other is
stuck:

[admin@dev3 ~]$ curl -XGET 'http://test1:9200/1/record/_search?
q=runId:uri_69603537&pretty=true'
{
"took" : 1921,
"timed_out" : false,
"_shards" : {
"total" : 10,
"successful" : 10,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 14.120542,...

}
[admin@dev3 ~]$ curl -XGET 'http://dev1:9200/1/record/_search?
q=runId:uri_69603537&pretty=true''

jstack on that server does not indicate that something is doing work
there.

Can you assist of debugging that? I tried few reboots, same behavior.

Thanks.


(system) #2