Running into deadlock of one index

ywtsang · August 3, 2012, 3:53am

We have been running elasticsearch (0.19.4) on a cluster of 4 machines
for 1 weeks and the cluster is mainly doing this kind of index job by
1 thread:

job 1:

update the index refresh interval to -1
prepare bulk index requests
submit bulk index requests
update the index refresh interval to 1s

and this runs properly.

But recently we added another thread to another index job at the same time:

job 2:

refresh the index by client.admin().indices().refresh
use scan api to find out the ids of the records we want
submit the ids found to job 1 for bulk reindex

Now we observe that the threads are deadlocked on (by finding the
stacktrace using kill -3 ) :

"refresh" the index (we also wait forever and never return when we
refresh the index by curl)
submit bulk index requests

I do not see any exception/error on any server log.

Sorry that I may not be able to gist something useful to reproduce the
problems. But may I ask if it is ok to run the above 2 jobs
concurrently and repeatedly? I am afraid I am using the api
incorrectly.

Thanks,
Wing

ywtsang · August 3, 2012, 4:08am

Sorry to miss more detail:

I have tried to restart the server that run the job 1 and the job 2,
but still, the index seem to be "deadlock":

cannot refresh (will hang)
cannot flush, will show error message:
{
"ok" : true,
"_shards" : {
"total" : 30,
"successful" : 29,
"failed" : 1,
"failures" : [ {
"index" : "i_product_ys_2",
"shard" : 0,
"reason" :
"BroadcastShardOperationFailedException[[i_product_ys_2][0] ]; nested:
RemoteTransportException[[Milan][inet[/10.1.4.196:9300]][indices/flush/s]];
nested: FlushNotAllowedEngineException[[i_product_ys_2][0] Already
flushing...]; "
} ]
}
}

and the server shows this exception stacktrace:

[2012-08-03 04:02:35,973][DEBUG][action.admin.indices.flush] [Gazer]
[i_product_ys_2][0], node[DhcefRBWRiKwdFrA76pE0A], [R], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.flush.FlushRequest@2fe72d73]
org.elasticsearch.transport.RemoteTransportException:
[Milan][inet[/10.1.4.196:9300]][indices/flush/s]
Caused by: org.elasticsearch.index.engine.FlushNotAllowedEngineException:
[i_product_ys_2][0] Already flushing...
at org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:797)
at org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:508)
at org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:119)
at org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:49)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:384)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:373)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

cannot show status

Wing

On Fri, Aug 3, 2012 at 11:53 AM, Yiu Wing TSANG ywtsang@gmail.com wrote:

We have been running elasticsearch (0.19.4) on a cluster of 4 machines
for 1 weeks and the cluster is mainly doing this kind of index job by
1 thread:

job 1:

update the index refresh interval to -1

prepare bulk index requests

submit bulk index requests

update the index refresh interval to 1s

and this runs properly.

But recently we added another thread to another index job at the same time:

job 2:

refresh the index by client.admin().indices().refresh

use scan api to find out the ids of the records we want

submit the ids found to job 1 for bulk reindex

Now we observe that the threads are deadlocked on (by finding the
stacktrace using kill -3 ) :

"refresh" the index (we also wait forever and never return when we
refresh the index by curl)

submit bulk index requests

I do not see any exception/error on any server log.

Sorry that I may not be able to gist something useful to reproduce the
problems. But may I ask if it is ok to run the above 2 jobs
concurrently and repeatedly? I am afraid I am using the api
incorrectly.

Thanks,
Wing

ywtsang · August 3, 2012, 6:17am

Sorry typo:

cannot show status
->
can show status

On Fri, Aug 3, 2012 at 12:08 PM, Yiu Wing TSANG ywtsang@gmail.com wrote:

Sorry to miss more detail:

I have tried to restart the server that run the job 1 and the job 2,
but still, the index seem to be "deadlock":

cannot refresh (will hang)

cannot flush, will show error message:
{
"ok" : true,
"_shards" : {
"total" : 30,
"successful" : 29,
"failed" : 1,
"failures" : [ {
"index" : "i_product_ys_2",
"shard" : 0,
"reason" :
"BroadcastShardOperationFailedException[[i_product_ys_2][0] ]; nested:
RemoteTransportException[[Milan][inet[/10.1.4.196:9300]][indices/flush/s]];
nested: FlushNotAllowedEngineException[[i_product_ys_2][0] Already
flushing...]; "
} ]
}
}

and the server shows this exception stacktrace:

[2012-08-03 04:02:35,973][DEBUG][action.admin.indices.flush] [Gazer]
[i_product_ys_2][0], node[DhcefRBWRiKwdFrA76pE0A], [R], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.flush.FlushRequest@2fe72d73]
org.elasticsearch.transport.RemoteTransportException:
[Milan][inet[/10.1.4.196:9300]][indices/flush/s]
Caused by: org.elasticsearch.index.engine.FlushNotAllowedEngineException:
[i_product_ys_2][0] Already flushing...
at org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:797)
at org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:508)
at org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:119)
at org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:49)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:384)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:373)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

cannot show status

Wing

On Fri, Aug 3, 2012 at 11:53 AM, Yiu Wing TSANG ywtsang@gmail.com wrote:

We have been running elasticsearch (0.19.4) on a cluster of 4 machines
for 1 weeks and the cluster is mainly doing this kind of index job by
1 thread:

job 1:

update the index refresh interval to -1

prepare bulk index requests

submit bulk index requests

update the index refresh interval to 1s

and this runs properly.

But recently we added another thread to another index job at the same time:

job 2:

refresh the index by client.admin().indices().refresh

use scan api to find out the ids of the records we want

submit the ids found to job 1 for bulk reindex

Now we observe that the threads are deadlocked on (by finding the
stacktrace using kill -3 ) :

"refresh" the index (we also wait forever and never return when we
refresh the index by curl)

submit bulk index requests

I do not see any exception/error on any server log.

Sorry that I may not be able to gist something useful to reproduce the
problems. But may I ask if it is ok to run the above 2 jobs
concurrently and repeatedly? I am afraid I am using the api
incorrectly.

Thanks,
Wing

ywtsang · August 3, 2012, 9:34am

Apologize if I made too much noise. Though I still cannot find the
root cause, I have just recovered the index by the following steps:

(the cluster have 4 machines: A (elected as master), B, C, D

step 1:
restart B, B can connect to the cluster and C and D can detect B but
B cannot join the cluster as it keeps receiving "does not have us
registered with it"; still, the index is at deadlock (i.e. cannot
refresh by curl)

step 2:
stop A, then C is automatically elected as a new master and A can
join the cluster and recovering has started, now the index seem to be
free from deadlock (can refresh by curl)

step 3:
start C and C rejoin the cluster and the cluster of the 4 machines
eventually can recover by itself and the index can run properly now

Thanks,
Wing

On Fri, Aug 3, 2012 at 2:17 PM, Yiu Wing TSANG ywtsang@gmail.com wrote:

Sorry typo:

cannot show status
->

can show status

On Fri, Aug 3, 2012 at 12:08 PM, Yiu Wing TSANG ywtsang@gmail.com wrote:

Sorry to miss more detail:

I have tried to restart the server that run the job 1 and the job 2,
but still, the index seem to be "deadlock":

cannot refresh (will hang)

cannot flush, will show error message:
{
"ok" : true,
"_shards" : {
"total" : 30,
"successful" : 29,
"failed" : 1,
"failures" : [ {
"index" : "i_product_ys_2",
"shard" : 0,
"reason" :
"BroadcastShardOperationFailedException[[i_product_ys_2][0] ]; nested:
RemoteTransportException[[Milan][inet[/10.1.4.196:9300]][indices/flush/s]];
nested: FlushNotAllowedEngineException[[i_product_ys_2][0] Already
flushing...]; "
} ]
}
}

and the server shows this exception stacktrace:

[2012-08-03 04:02:35,973][DEBUG][action.admin.indices.flush] [Gazer]
[i_product_ys_2][0], node[DhcefRBWRiKwdFrA76pE0A], [R], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.flush.FlushRequest@2fe72d73]
org.elasticsearch.transport.RemoteTransportException:
[Milan][inet[/10.1.4.196:9300]][indices/flush/s]
Caused by: org.elasticsearch.index.engine.FlushNotAllowedEngineException:
[i_product_ys_2][0] Already flushing...
at org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:797)
at org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:508)
at org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:119)
at org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:49)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:384)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:373)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

cannot show status

Wing

On Fri, Aug 3, 2012 at 11:53 AM, Yiu Wing TSANG ywtsang@gmail.com wrote:

We have been running elasticsearch (0.19.4) on a cluster of 4 machines
for 1 weeks and the cluster is mainly doing this kind of index job by
1 thread:

job 1:

update the index refresh interval to -1

prepare bulk index requests

submit bulk index requests

update the index refresh interval to 1s

and this runs properly.

But recently we added another thread to another index job at the same time:

job 2:

refresh the index by client.admin().indices().refresh

use scan api to find out the ids of the records we want

submit the ids found to job 1 for bulk reindex

Now we observe that the threads are deadlocked on (by finding the
stacktrace using kill -3 ) :

"refresh" the index (we also wait forever and never return when we
refresh the index by curl)

submit bulk index requests

I do not see any exception/error on any server log.

Sorry that I may not be able to gist something useful to reproduce the
problems. But may I ask if it is ok to run the above 2 jobs
concurrently and repeatedly? I am afraid I am using the api
incorrectly.

Thanks,
Wing

ywtsang · August 8, 2012, 12:16pm

Very sorry that I just find my recovery steps reported in a wrong way and
for reference, here is my corrected recovery:

The cluster have 4 machines: A (elected as master), B, C, D

step 1:
restart B,
B can connect to the cluster,
C and D can detect B but B cannot join the cluster as it keeps receiving
"does not have us registered with it",
still, the index is at deadlock (i.e. cannot refresh by curl)

step 2:
stop A,
C is automatically elected as a new master,
B can join the cluster
recovering has started, now the index seem to be free from deadlock (can
refresh by curl)

step 3:
start A
A rejoin the cluster
the cluster of the 4 machines can recover by itself and the index can run
properly now

Regards,
Wing

On Friday, August 3, 2012 5:34:19 PM UTC+8, Yiu Wing TSANG wrote:

Apologize if I made too much noise. Though I still cannot find the
root cause, I have just recovered the index by the following steps:

(the cluster have 4 machines: A (elected as master), B, C, D

step 1:
restart B, B can connect to the cluster and C and D can detect B but
B cannot join the cluster as it keeps receiving "does not have us
registered with it"; still, the index is at deadlock (i.e. cannot
refresh by curl)

step 2:
stop A, then C is automatically elected as a new master and A can
join the cluster and recovering has started, now the index seem to be
free from deadlock (can refresh by curl)

step 3:
start C and C rejoin the cluster and the cluster of the 4 machines
eventually can recover by itself and the index can run properly now

Thanks,
Wing

On Fri, Aug 3, 2012 at 2:17 PM, Yiu Wing TSANG ywtsang@gmail.com wrote:

Sorry typo:

cannot show status
->

can show status

On Fri, Aug 3, 2012 at 12:08 PM, Yiu Wing TSANG ywtsang@gmail.com
wrote:

Sorry to miss more detail:

I have tried to restart the server that run the job 1 and the job 2,
but still, the index seem to be "deadlock":

cannot refresh (will hang)

cannot flush, will show error message:
{
"ok" : true,
"_shards" : {
"total" : 30,
"successful" : 29,
"failed" : 1,
"failures" : [ {
"index" : "i_product_ys_2",
"shard" : 0,
"reason" :
"BroadcastShardOperationFailedException[[i_product_ys_2][0] ]; nested:

RemoteTransportException[[Milan][inet[/10.1.4.196:9300]][indices/flush/s]];

nested: FlushNotAllowedEngineException[[i_product_ys_2][0] Already
flushing...]; "
} ]
}
}

and the server shows this exception stacktrace:

[2012-08-03 04:02:35,973][DEBUG][action.admin.indices.flush] [Gazer]
[i_product_ys_2][0], node[DhcefRBWRiKwdFrA76pE0A], [R], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.flush.FlushRequest@2fe72d73]
org.elasticsearch.transport.RemoteTransportException:
[Milan][inet[/10.1.4.196:9300]][indices/flush/s]
Caused by:
org.elasticsearch.index.engine.FlushNotAllowedEngineException:
[i_product_ys_2][0] Already flushing...
at
org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:797)
    at 
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:508)
    at 
org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:119)
    at 
org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:49)
    at 
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)
    at 
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:384)
    at 
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:373)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662) 
cannot show status

Wing

On Fri, Aug 3, 2012 at 11:53 AM, Yiu Wing TSANG ywtsang@gmail.com
wrote:

We have been running elasticsearch (0.19.4) on a cluster of 4 machines
for 1 weeks and the cluster is mainly doing this kind of index job by
1 thread:

job 1:

update the index refresh interval to -1

prepare bulk index requests

submit bulk index requests

update the index refresh interval to 1s

and this runs properly.

But recently we added another thread to another index job at the same
time:

job 2:

refresh the index by client.admin().indices().refresh

use scan api to find out the ids of the records we want

submit the ids found to job 1 for bulk reindex

Now we observe that the threads are deadlocked on (by finding the
stacktrace using kill -3 ) :

"refresh" the index (we also wait forever and never return when we
refresh the index by curl)

submit bulk index requests

I do not see any exception/error on any server log.

Sorry that I may not be able to gist something useful to reproduce the
problems. But may I ask if it is ok to run the above 2 jobs
concurrently and repeatedly? I am afraid I am using the api
incorrectly.

Thanks,
Wing

Topic		Replies	Views
Deadlock in Bulk Ingester using Elastic 8.15.3 in Java Elasticsearch language-clients	3	29	November 12, 2024
Java api client bulk processor locking up, both ES 5 and 2.3 Elasticsearch	3	666	June 27, 2017
BulkProcessor deadlock on cluster failure Elasticsearch	1	1218	January 3, 2018
Deadlock in Elastic Search 0.17.4! Elasticsearch	9	700	July 6, 2017
Bulk update times out Elasticsearch	3	321	July 6, 2017

Running into deadlock of one index

Related topics