Cluster stopped ingesting, with "failed to obtain in-memory shard lock"

My cluster logged some "failed to obtain in-memory shard lock" messages over a period of the night, finishing at around 04:33 this morning, and there is no data in most of the indexes after 04:33, ie it has stopped indexing data. (There's just one, vary sparsely used, index with data in it past that time.)

All shards are showing as STARTED. retry_failed didn't do anything. Restarting each node in the cluster didn't do anything.

What else do I need to look at? How do I get my cluster indexing again?

The reason for a probably in the middle of the night may have been that I was reindexing hundreds of gigabytes of data, and at some point in the process one or two of the nodes might have got short of disk space for relocating shards. There is currently no shortage of disk space on any node.

Data should be but isn't coming in from Logstash, from Metricbeat and from some Python scripts. The index that is still being written to comes from a Java application.

Ah, and then I find the following in the Python logs. I'd better now try to find out what the concept of a "read only index" is.

{
	u'update': {
		u'status': 403,
		u'_type': u'doc',
		u'_index': u'event-2018.07',
		u'error': {
			u'reason': u'blockedby: [FORBIDDEN/12/indexread-only/allowdelete(api)];',
			u'type': u'cluster_block_exception'
		},
		u'_id': u'et3-tim-2.imagiro.ltd_threshold_Datapushgroupcount_2018-07-02T13: 18: 02.265Z',
		u'data': {
			'doc': {
				'alerted-level': 'critical',
				'alerted': True
			}
		}
	}
}

This would explain the message blockedby: [FORBIDDEN/12/indexread-only/allowdelete(api)]: if a node exceeds the flood-stage disk watermark (95% of disk capacity by default) then all indices with shards on that node are marked as read-only. The documentation on disk-based shard allocation describes this in more detail and also describes how to recover.

This may or may not be related to the "failed to obtain in-memory shard lock" message, but it sounds like this is your actual problem here.

1 Like

And having got that clue from the Python application logs the fix (after several hours' research) was

PUT /_all/_settings
{
  "index": {
    "blocks.read_only_allow_delete": null
  }
}
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.