Running into deadlock of one index

We have been running elasticsearch (0.19.4) on a cluster of 4 machines
for 1 weeks and the cluster is mainly doing this kind of index job by
1 thread:

job 1:

  1. update the index refresh interval to -1
  2. prepare bulk index requests
  3. submit bulk index requests
  4. update the index refresh interval to 1s

and this runs properly.

But recently we added another thread to another index job at the same time:

job 2:

  1. refresh the index by client.admin().indices().refresh
  2. use scan api to find out the ids of the records we want
  3. submit the ids found to job 1 for bulk reindex

Now we observe that the threads are deadlocked on (by finding the
stacktrace using kill -3 ) :

  1. "refresh" the index (we also wait forever and never return when we
    refresh the index by curl)
  2. submit bulk index requests

I do not see any exception/error on any server log.

Sorry that I may not be able to gist something useful to reproduce the
problems. But may I ask if it is ok to run the above 2 jobs
concurrently and repeatedly? I am afraid I am using the api
incorrectly.

Thanks,
Wing

Sorry to miss more detail:

I have tried to restart the server that run the job 1 and the job 2,
but still, the index seem to be "deadlock":

  1. cannot refresh (will hang)

  2. cannot flush, will show error message:
    {
    "ok" : true,
    "_shards" : {
    "total" : 30,
    "successful" : 29,
    "failed" : 1,
    "failures" : [ {
    "index" : "i_product_ys_2",
    "shard" : 0,
    "reason" :
    "BroadcastShardOperationFailedException[[i_product_ys_2][0] ]; nested:
    RemoteTransportException[[Milan][inet[/10.1.4.196:9300]][indices/flush/s]];
    nested: FlushNotAllowedEngineException[[i_product_ys_2][0] Already
    flushing...]; "
    } ]
    }
    }

and the server shows this exception stacktrace:

[2012-08-03 04:02:35,973][DEBUG][action.admin.indices.flush] [Gazer]
[i_product_ys_2][0], node[DhcefRBWRiKwdFrA76pE0A], [R], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.flush.FlushRequest@2fe72d73]
org.elasticsearch.transport.RemoteTransportException:
[Milan][inet[/10.1.4.196:9300]][indices/flush/s]
Caused by: org.elasticsearch.index.engine.FlushNotAllowedEngineException:
[i_product_ys_2][0] Already flushing...
at org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:797)
at org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:508)
at org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:119)
at org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:49)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:384)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:373)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

  1. cannot show status

Wing

On Fri, Aug 3, 2012 at 11:53 AM, Yiu Wing TSANG ywtsang@gmail.com wrote:

We have been running elasticsearch (0.19.4) on a cluster of 4 machines
for 1 weeks and the cluster is mainly doing this kind of index job by
1 thread:

job 1:

  1. update the index refresh interval to -1
  2. prepare bulk index requests
  3. submit bulk index requests
  4. update the index refresh interval to 1s

and this runs properly.

But recently we added another thread to another index job at the same time:

job 2:

  1. refresh the index by client.admin().indices().refresh
  2. use scan api to find out the ids of the records we want
  3. submit the ids found to job 1 for bulk reindex

Now we observe that the threads are deadlocked on (by finding the
stacktrace using kill -3 ) :

  1. "refresh" the index (we also wait forever and never return when we
    refresh the index by curl)
  2. submit bulk index requests

I do not see any exception/error on any server log.

Sorry that I may not be able to gist something useful to reproduce the
problems. But may I ask if it is ok to run the above 2 jobs
concurrently and repeatedly? I am afraid I am using the api
incorrectly.

Thanks,
Wing

Sorry typo:

  1. cannot show status
    ->
  2. can show status

On Fri, Aug 3, 2012 at 12:08 PM, Yiu Wing TSANG ywtsang@gmail.com wrote:

Sorry to miss more detail:

I have tried to restart the server that run the job 1 and the job 2,
but still, the index seem to be "deadlock":

  1. cannot refresh (will hang)

  2. cannot flush, will show error message:
    {
    "ok" : true,
    "_shards" : {
    "total" : 30,
    "successful" : 29,
    "failed" : 1,
    "failures" : [ {
    "index" : "i_product_ys_2",
    "shard" : 0,
    "reason" :
    "BroadcastShardOperationFailedException[[i_product_ys_2][0] ]; nested:
    RemoteTransportException[[Milan][inet[/10.1.4.196:9300]][indices/flush/s]];
    nested: FlushNotAllowedEngineException[[i_product_ys_2][0] Already
    flushing...]; "
    } ]
    }
    }

and the server shows this exception stacktrace:

[2012-08-03 04:02:35,973][DEBUG][action.admin.indices.flush] [Gazer]
[i_product_ys_2][0], node[DhcefRBWRiKwdFrA76pE0A], [R], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.flush.FlushRequest@2fe72d73]
org.elasticsearch.transport.RemoteTransportException:
[Milan][inet[/10.1.4.196:9300]][indices/flush/s]
Caused by: org.elasticsearch.index.engine.FlushNotAllowedEngineException:
[i_product_ys_2][0] Already flushing...
at org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:797)
at org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:508)
at org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:119)
at org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:49)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:384)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:373)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

  1. cannot show status

Wing

On Fri, Aug 3, 2012 at 11:53 AM, Yiu Wing TSANG ywtsang@gmail.com wrote:

We have been running elasticsearch (0.19.4) on a cluster of 4 machines
for 1 weeks and the cluster is mainly doing this kind of index job by
1 thread:

job 1:

  1. update the index refresh interval to -1
  2. prepare bulk index requests
  3. submit bulk index requests
  4. update the index refresh interval to 1s

and this runs properly.

But recently we added another thread to another index job at the same time:

job 2:

  1. refresh the index by client.admin().indices().refresh
  2. use scan api to find out the ids of the records we want
  3. submit the ids found to job 1 for bulk reindex

Now we observe that the threads are deadlocked on (by finding the
stacktrace using kill -3 ) :

  1. "refresh" the index (we also wait forever and never return when we
    refresh the index by curl)
  2. submit bulk index requests

I do not see any exception/error on any server log.

Sorry that I may not be able to gist something useful to reproduce the
problems. But may I ask if it is ok to run the above 2 jobs
concurrently and repeatedly? I am afraid I am using the api
incorrectly.

Thanks,
Wing

Apologize if I made too much noise. Though I still cannot find the
root cause, I have just recovered the index by the following steps:

(the cluster have 4 machines: A (elected as master), B, C, D

step 1:
restart B, B can connect to the cluster and C and D can detect B but
B cannot join the cluster as it keeps receiving "does not have us
registered with it"; still, the index is at deadlock (i.e. cannot
refresh by curl)

step 2:
stop A, then C is automatically elected as a new master and A can
join the cluster and recovering has started, now the index seem to be
free from deadlock (can refresh by curl)

step 3:
start C and C rejoin the cluster and the cluster of the 4 machines
eventually can recover by itself and the index can run properly now

Thanks,
Wing

On Fri, Aug 3, 2012 at 2:17 PM, Yiu Wing TSANG ywtsang@gmail.com wrote:

Sorry typo:

  1. cannot show status
    ->
  2. can show status

On Fri, Aug 3, 2012 at 12:08 PM, Yiu Wing TSANG ywtsang@gmail.com wrote:

Sorry to miss more detail:

I have tried to restart the server that run the job 1 and the job 2,
but still, the index seem to be "deadlock":

  1. cannot refresh (will hang)

  2. cannot flush, will show error message:
    {
    "ok" : true,
    "_shards" : {
    "total" : 30,
    "successful" : 29,
    "failed" : 1,
    "failures" : [ {
    "index" : "i_product_ys_2",
    "shard" : 0,
    "reason" :
    "BroadcastShardOperationFailedException[[i_product_ys_2][0] ]; nested:
    RemoteTransportException[[Milan][inet[/10.1.4.196:9300]][indices/flush/s]];
    nested: FlushNotAllowedEngineException[[i_product_ys_2][0] Already
    flushing...]; "
    } ]
    }
    }

and the server shows this exception stacktrace:

[2012-08-03 04:02:35,973][DEBUG][action.admin.indices.flush] [Gazer]
[i_product_ys_2][0], node[DhcefRBWRiKwdFrA76pE0A], [R], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.flush.FlushRequest@2fe72d73]
org.elasticsearch.transport.RemoteTransportException:
[Milan][inet[/10.1.4.196:9300]][indices/flush/s]
Caused by: org.elasticsearch.index.engine.FlushNotAllowedEngineException:
[i_product_ys_2][0] Already flushing...
at org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:797)
at org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:508)
at org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:119)
at org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:49)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:384)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:373)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

  1. cannot show status

Wing

On Fri, Aug 3, 2012 at 11:53 AM, Yiu Wing TSANG ywtsang@gmail.com wrote:

We have been running elasticsearch (0.19.4) on a cluster of 4 machines
for 1 weeks and the cluster is mainly doing this kind of index job by
1 thread:

job 1:

  1. update the index refresh interval to -1
  2. prepare bulk index requests
  3. submit bulk index requests
  4. update the index refresh interval to 1s

and this runs properly.

But recently we added another thread to another index job at the same time:

job 2:

  1. refresh the index by client.admin().indices().refresh
  2. use scan api to find out the ids of the records we want
  3. submit the ids found to job 1 for bulk reindex

Now we observe that the threads are deadlocked on (by finding the
stacktrace using kill -3 ) :

  1. "refresh" the index (we also wait forever and never return when we
    refresh the index by curl)
  2. submit bulk index requests

I do not see any exception/error on any server log.

Sorry that I may not be able to gist something useful to reproduce the
problems. But may I ask if it is ok to run the above 2 jobs
concurrently and repeatedly? I am afraid I am using the api
incorrectly.

Thanks,
Wing

Very sorry that I just find my recovery steps reported in a wrong way and
for reference, here is my corrected recovery:

The cluster have 4 machines: A (elected as master), B, C, D

step 1:
restart B,
B can connect to the cluster,
C and D can detect B but B cannot join the cluster as it keeps receiving
"does not have us registered with it",
still, the index is at deadlock (i.e. cannot refresh by curl)

step 2:
stop A,
C is automatically elected as a new master,
B can join the cluster
recovering has started, now the index seem to be free from deadlock (can
refresh by curl)

step 3:
start A
A rejoin the cluster
the cluster of the 4 machines can recover by itself and the index can run
properly now

Regards,
Wing

On Friday, August 3, 2012 5:34:19 PM UTC+8, Yiu Wing TSANG wrote:

Apologize if I made too much noise. Though I still cannot find the
root cause, I have just recovered the index by the following steps:

(the cluster have 4 machines: A (elected as master), B, C, D

step 1:
restart B, B can connect to the cluster and C and D can detect B but
B cannot join the cluster as it keeps receiving "does not have us
registered with it"; still, the index is at deadlock (i.e. cannot
refresh by curl)

step 2:
stop A, then C is automatically elected as a new master and A can
join the cluster and recovering has started, now the index seem to be
free from deadlock (can refresh by curl)

step 3:
start C and C rejoin the cluster and the cluster of the 4 machines
eventually can recover by itself and the index can run properly now

Thanks,
Wing

On Fri, Aug 3, 2012 at 2:17 PM, Yiu Wing TSANG ywtsang@gmail.com wrote:

Sorry typo:

  1. cannot show status
    ->
  2. can show status

On Fri, Aug 3, 2012 at 12:08 PM, Yiu Wing TSANG ywtsang@gmail.com
wrote:

Sorry to miss more detail:

I have tried to restart the server that run the job 1 and the job 2,
but still, the index seem to be "deadlock":

  1. cannot refresh (will hang)

  2. cannot flush, will show error message:
    {
    "ok" : true,
    "_shards" : {
    "total" : 30,
    "successful" : 29,
    "failed" : 1,
    "failures" : [ {
    "index" : "i_product_ys_2",
    "shard" : 0,
    "reason" :
    "BroadcastShardOperationFailedException[[i_product_ys_2][0] ]; nested:

RemoteTransportException[[Milan][inet[/10.1.4.196:9300]][indices/flush/s]];

nested: FlushNotAllowedEngineException[[i_product_ys_2][0] Already
flushing...]; "
} ]
}
}

and the server shows this exception stacktrace:

[2012-08-03 04:02:35,973][DEBUG][action.admin.indices.flush] [Gazer]
[i_product_ys_2][0], node[DhcefRBWRiKwdFrA76pE0A], [R], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.flush.FlushRequest@2fe72d73]
org.elasticsearch.transport.RemoteTransportException:
[Milan][inet[/10.1.4.196:9300]][indices/flush/s]
Caused by:
org.elasticsearch.index.engine.FlushNotAllowedEngineException:
[i_product_ys_2][0] Already flushing...
at
org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:797)

    at 

org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:508)

    at 

org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:119)

    at 

org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:49)

    at 

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)

    at 

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:384)

    at 

org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:373)

    at 

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

    at 

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

    at java.lang.Thread.run(Thread.java:662) 
  1. cannot show status

Wing

On Fri, Aug 3, 2012 at 11:53 AM, Yiu Wing TSANG ywtsang@gmail.com
wrote:

We have been running elasticsearch (0.19.4) on a cluster of 4 machines
for 1 weeks and the cluster is mainly doing this kind of index job by
1 thread:

job 1:

  1. update the index refresh interval to -1
  2. prepare bulk index requests
  3. submit bulk index requests
  4. update the index refresh interval to 1s

and this runs properly.

But recently we added another thread to another index job at the same
time:

job 2:

  1. refresh the index by client.admin().indices().refresh
  2. use scan api to find out the ids of the records we want
  3. submit the ids found to job 1 for bulk reindex

Now we observe that the threads are deadlocked on (by finding the
stacktrace using kill -3 ) :

  1. "refresh" the index (we also wait forever and never return when we
    refresh the index by curl)
  2. submit bulk index requests

I do not see any exception/error on any server log.

Sorry that I may not be able to gist something useful to reproduce the
problems. But may I ask if it is ok to run the above 2 jobs
concurrently and repeatedly? I am afraid I am using the api
incorrectly.

Thanks,
Wing