Very sorry that I just find my recovery steps reported in a wrong way and
for reference, here is my corrected recovery:
The cluster have 4 machines: A (elected as master), B, C, D
step 1:
restart B,
B can connect to the cluster,
C and D can detect B but B cannot join the cluster as it keeps receiving
"does not have us registered with it",
still, the index is at deadlock (i.e. cannot refresh by curl)
step 2:
stop A,
C is automatically elected as a new master,
B can join the cluster
recovering has started, now the index seem to be free from deadlock (can
refresh by curl)
step 3:
start A
A rejoin the cluster
the cluster of the 4 machines can recover by itself and the index can run
properly now
Regards,
Wing
On Friday, August 3, 2012 5:34:19 PM UTC+8, Yiu Wing TSANG wrote:
Apologize if I made too much noise. Though I still cannot find the
root cause, I have just recovered the index by the following steps:
(the cluster have 4 machines: A (elected as master), B, C, D
step 1:
restart B, B can connect to the cluster and C and D can detect B but
B cannot join the cluster as it keeps receiving "does not have us
registered with it"; still, the index is at deadlock (i.e. cannot
refresh by curl)
step 2:
stop A, then C is automatically elected as a new master and A can
join the cluster and recovering has started, now the index seem to be
free from deadlock (can refresh by curl)
step 3:
start C and C rejoin the cluster and the cluster of the 4 machines
eventually can recover by itself and the index can run properly now
Thanks,
Wing
On Fri, Aug 3, 2012 at 2:17 PM, Yiu Wing TSANG ywtsang@gmail.com wrote:
Sorry typo:
- cannot show status
->
- can show status
On Fri, Aug 3, 2012 at 12:08 PM, Yiu Wing TSANG ywtsang@gmail.com
wrote:
Sorry to miss more detail:
I have tried to restart the server that run the job 1 and the job 2,
but still, the index seem to be "deadlock":
-
cannot refresh (will hang)
-
cannot flush, will show error message:
{
"ok" : true,
"_shards" : {
"total" : 30,
"successful" : 29,
"failed" : 1,
"failures" : [ {
"index" : "i_product_ys_2",
"shard" : 0,
"reason" :
"BroadcastShardOperationFailedException[[i_product_ys_2][0] ]; nested:
RemoteTransportException[[Milan][inet[/10.1.4.196:9300]][indices/flush/s]];
nested: FlushNotAllowedEngineException[[i_product_ys_2][0] Already
flushing...]; "
} ]
}
}
and the server shows this exception stacktrace:
[2012-08-03 04:02:35,973][DEBUG][action.admin.indices.flush] [Gazer]
[i_product_ys_2][0], node[DhcefRBWRiKwdFrA76pE0A], [R], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.flush.FlushRequest@2fe72d73]
org.elasticsearch.transport.RemoteTransportException:
[Milan][inet[/10.1.4.196:9300]][indices/flush/s]
Caused by:
org.elasticsearch.index.engine.FlushNotAllowedEngineException:
[i_product_ys_2][0] Already flushing...
at
org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:797)
at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:508)
at
org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:119)
at
org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:49)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:384)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:373)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
- cannot show status
Wing
On Fri, Aug 3, 2012 at 11:53 AM, Yiu Wing TSANG ywtsang@gmail.com
wrote:
We have been running elasticsearch (0.19.4) on a cluster of 4 machines
for 1 weeks and the cluster is mainly doing this kind of index job by
1 thread:
job 1:
- update the index refresh interval to -1
- prepare bulk index requests
- submit bulk index requests
- update the index refresh interval to 1s
and this runs properly.
But recently we added another thread to another index job at the same
time:
job 2:
- refresh the index by client.admin().indices().refresh
- use scan api to find out the ids of the records we want
- submit the ids found to job 1 for bulk reindex
Now we observe that the threads are deadlocked on (by finding the
stacktrace using kill -3 ) :
- "refresh" the index (we also wait forever and never return when we
refresh the index by curl)
- submit bulk index requests
I do not see any exception/error on any server log.
Sorry that I may not be able to gist something useful to reproduce the
problems. But may I ask if it is ok to run the above 2 jobs
concurrently and repeatedly? I am afraid I am using the api
incorrectly.
Thanks,
Wing