How to investigate why Elasticsearch required manual intervention after "failed to obtain in-memory shard lock"

We had a case where one cluster had flaky network for a short time and thus nodes were lost form the cluster, started to rejoin, only to be disconnected again a few times.
This seems to have left the cluster in a bad shape, as some shards could not be automatically allocated any more with an allocation explain-error of "failed to obtain in-memory shard lock".

Doing a manual retry_failed did bring back the shards, however it took quite some time until this was performed and now only 1/3 data is available for this timeframe (either the additional data was not written at all or removed by recovery when the shards finally synced up again).

Elasticsearch 5.3.3 on AWS EC2 instances using EC2 and S3 plugin, 3 nodes, minimum_master_nodes=2

The problem seems to be similar to what is discussed at https://github.com/elastic/elasticsearch/issues/23199, where the last comment states that separate issues should be created, however I am not sure I have enough information to warrant an issue.

Below is a full rundown of the events that were visible in logs.

Questions that I have:

  • Any improvements in newer Elasticsearch 5.6.8 or 6.2.3 to avoid this need for manual intervention
  • Is there a way to avoid the manual intervention other than setting "index.allocation.max_retries" to a very high value?
  • Should I report this as issue for further investigation?

Timeline (edited slightly to remove some sensible information):

node 143

13:02:24,691][WARN ][o.e.a.b.TransportShardBulkAction] [i-node1] [[type3-w2018.18][1]] failed to perform indices:data/write/bulk[s] on replica [type3-w2018.18][1], node[KJ815lGvT2-ipGzAgP3uHA], [R], s[STARTED], a[id=cb27ySMARoawcPJIabKYRQ] org.elasticsearch.transport.NodeDisconnectedException: [i-node2][node2:9300][indices:data/write/bulk[s][r]] disconnected

node 24

13:02:24,589][WARN ][o.e.t.n.Netty4Transport ] [i-node2] exception caught on transport layer [[id: 0x1b6487a3, L:/10.176.250.24:9300 - R:/10.176.252.143:37158]], closing connection
io.netty.handler.codec.DecoderException: java.io.StreamCorruptedException: invalid internal transport message format, got (75,73,79,54)

node 143

13:02:24,699][WARN ][o.e.c.a.s.ShardStateAction] [i-node1] [type3-w2018.18][1] received shard failed for shard id [[type3-w2018.18][1]], allocation id [cb27ySMARoawcPJIabKYRQ], primary term [1], message [failed to perform indices:data/write/bulk[s] on replica [type3-w2018.18][1], node[KJ815lGvT2-ipGzAgP3uHA], [R], s[STARTED], a[id=cb27ySMARoawcPJIabKYRQ]], failure [NodeDisconnectedException[[i-node2][node2:9300][indices:data/write/bulk[s][r]] disconnected]]
org.elasticsearch.transport.NodeDisconnectedException: [i-node2][node2:9300][indices:data/write/bulk[s][r]] disconnected

node 143

13:02:24,710][INFO ][o.e.c.r.a.AllocationService] [i-node1] Cluster health status changed from [GREEN] to [YELLOW] (reason: [shards failed [[type3-w2018.18][1]] ...]).

node 143

13:02:25,662][WARN ][o.e.a.b.TransportShardBulkAction] [i-node1] [[type1-m2018.05][1]] failed to perform indices:data/write/bulk[s] on replica [type1-m2018.05][1], node[KJ815lGvT2-ipGzAgP3uHA], [R], s[STARTED], a[id=uoXrX1iYT5eAIJa2-9jMuw]
org.elasticsearch.transport.NodeNotConnectedException: [i-node2][node2:9300] Node not connected

node 24

13:02:32,179][INFO ][o.e.d.z.ZenDiscovery ] [i-node2] master_left [i-node1], reason [transport disconnected]

node 24

13:02:32,184][WARN ][o.e.d.z.ZenDiscovery ] [i-node2] master left (reason = transport disconnected), current nodes: nodes:
i-node1, master
i-node3
i-node2, local

node 24

13:02:32,644][DEBUG][o.e.d.e.AwsEc2UnicastHostsProvider] [i-node2] using dynamic discovery nodes [i-node1, i-node3, i-node2]

node 143

13:02:50,166][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [i-node1] failed to execute on node [lMcyoREoQ0iezPJlQEC6uA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: i-node3[cluster:monitor/nodes/stats[n]] request_id [109253247] timed out after [15001ms]

node 143

13:02:54,717][WARN ][o.e.c.NodeConnectionsService] [i-node1] failed to connect to node i-node2 (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [i-node2][node2:9300] connect_timeout[30s]

node 24

13:03:25,757][WARN ][o.e.c.NodeConnectionsService] [i-node2] failed to connect to node i-node1 (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [i-node1][10.176.252.143:9300] handshake_timeout[30s]

node 24

13:03:25,771][WARN ][o.e.c.s.ClusterService ] [i-node2] cluster state update task [master_failed (i-node1)] took [53.5s] above the warn threshold of 30s

node 143

May 5 13:03:52 ip-10-176-252-143 kernel: [5112725.308112] Detected Tx Unit Hang

node 209

13:03:53,898][WARN ][o.e.d.z.ZenDiscovery ] [i-node3] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:
i-node3, local
i-node1, master
i-node2

Timeline part 2

node 143

13:03:57,790][WARN ][o.e.d.z.ZenDiscovery ] [i-node1] zen-disco-failed-to-publish, current nodes: nodes:
i-node1, local, master
i-node2
i-node3

node 143

13:03:59,797][INFO ][o.e.d.e.AwsEc2UnicastHostsProvider] [i-node1] Exception while retrieving instance list from AWS API: Unable to load credentials from Amazon EC2 metadata service

node 143

May 5 13:03:59 ip-10-176-252-143 kernel: [5112732.193843] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps

node 143

13:04:16,145][WARN ][o.e.d.z.ZenDiscovery ] [i-node1] not enough master nodes, current nodes: nodes:
i-node1, local
i-node2
i-node3

node 209

13:05:26,103][INFO ][o.e.c.s.ClusterService ] [i-node3] detected_master i-node2, reason: zen-disco-receive(from master [master i-node2 committed version [924723]])

node 24

13:05:26,573][INFO ][o.e.c.r.a.AllocationService] [i-node2] Cluster health status changed from [GREEN] to [YELLOW] (reason: [shards failed [[type3-w2018.18-16][1], [type1-m2018.05][1], [type2-w2018.18][1], [type3-w2018.18][1], [type3-w2018.18-16][1], [type3-w2018.18][1], [type3-w2018.18][1], [type2-w2018.18-16][0], [type2-w2018.18-4][2], [type3-w2018.18][1]] ...]).

node 143

2018-05-05T13:05:32,407][INFO ][o.e.d.z.ZenDiscovery ] [i-node1] master_left [i-node2], reason [transport disconnected]

node 209

13:06:02,908][INFO ][o.e.c.s.ClusterService ] [i-node3] removed {i-node1,}, reason: zen-disco-receive(from master [master i-node2 committed version [924728]])

node 24

13:06:03,919][WARN ][o.e.c.a.s.ShardStateAction] [i-node2] [type2-w2018.18-16][2] received shard failed for shard id [[type2-w2018.18-16][2]], allocation id [HBVZHVz3R0q0rfKFA5e_9Q], primary term [2], message [mark copy as stale]

node 143

May 5 13:07:35 ip-10-176-252-143 kernel: [5112948.732099] Detected Tx Unit Hang

node 24

13:17:39,480][INFO ][o.e.c.s.ClusterService ] [i-node2] added {i-node1,}, reason: zen-disco-node-join[i-node1]

node 143

13:17:39,508][INFO ][o.e.c.s.ClusterService ] [i-node1] detected_master i-node2, reason: zen-disco-receive(from master [master i-node2 committed version [924787]])

node 24

13:17:46,267][WARN ][o.e.c.a.s.ShardStateAction] [i-node2] [type1-m2018.05][1] received shard failed for shard id [[type1-m2018.05][1]], allocation id [k1KOGv8sTemz5sx37LWMzA], primary term [0], message [failed to create shard], failure [IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[type1-m2018.05][1]: obtaining shard lock timed out after 5000ms]; ]
java.io.IOException: failed to obtain in-memory shard lock

node 24

13:17:46,411][WARN ][o.e.c.a.s.ShardStateAction] [i-node2] [type1-m2018.05][1] received shard failed for shard id [[type1-m2018.05][1]], allocation id [k1KOGv8sTemz5sx37LWMzA], primary term [0], message [master i-node2 has not removed previously failed shard. resending shard failure]

node 143

13:17:46,266][WARN ][o.e.i.c.IndicesClusterStateService] [i-node1] [[type1-m2018.05][1]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock

node 143

May 5 13:18:03 ip-10-176-252-143 kernel: [5113576.444117] Detected Tx Unit Hang

node 143

13:21:24,349][WARN ][o.e.i.c.IndicesClusterStateService] [i-node1] [[type2-w2018.18-64][2]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [type2-w2018.18-64][2]: Recovery failed from i-node2 into i-node1

node 24

13:24:31,780][WARN ][o.e.c.a.s.ShardStateAction] [i-node2] [type2-w2018.18-64][2] received shard failed for shard id [[type2-w2018.18-64][2]], allocation id [m2_66-FQT0K5c5Ns0PHMEA], primary term [0], message [failed recovery], failure [RecoveryFailedException[[type2-w2018.18-64][2]: Recovery failed from i-node2 into i-node1]; nested: RemoteTransportException[[i-node2][node2:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]];nested: IllegalStateException[try to recover [type2-w2018.18-64][2] from primary shard with sync id but number of docs differ: 255794 (i-node2, primary) vs 255782(i-node1)]; ]

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.