ES node remained green on VM although underlying disk failed

A_B · April 18, 2019, 11:27am

Hello,

I faced a situation today where ES health checks remained green although one node had definite problems (which caused Logstash to block, etc).

The scenario
All my ES nodes run as libvirt guest VMs. Each VM has a dedicated SSD drive which is made into a single LVM for the VM to use. One of the SSDs failed but the VM and ES inside it continued running but could not write new data (I assume, as I was not able to access that VM over ssh anymore).
ES node status remained green on the VM that was using the failed disk.

Has anyone else had a similar problem? Is there any setting to force ES to stop in certain situation? Or is this more of a libvirt issue?

Any suggestions welcome

DavidTurner · April 18, 2019, 11:40am

Hi @A_B,

Disk failures manifest in many different ways, and sometimes it is hard for an application to tell that anything is really wrong. Do you have logs from Elasticsearch so we can see what this failure looked like from its point of view?

Are you mounting the filesystem with errors=remount-ro or errors=panic? This is a good way to stop an Elasticsearch node when there are signs of trouble.

A_B · April 18, 2019, 11:45am

Hi @DavidTurner,

thank you very much for the quick reply

I will check how the filesystem is mounted.

I have shut down the whole VM so not sure if I can get at the local ES logs. But I can give it a go. I should be able to mount the disk directly from the libvirt host.

A_B · April 18, 2019, 12:18pm

The failed drive is not accessible anymore so those Elasticsearch logs are gone

On a identical VM I checked the mounting config

/dev/mapper/main-root on / type ext4 (rw,relatime,errors=remount-ro,data=ordered)

On other nodes in the cluster I see these as I killed the "bad" VM named es-hay0-18 (as expected)

[2019-04-18T09:24:58,108][INFO ][o.e.c.s.ClusterApplierService] [es-hay0-19] removed {{es-hay0-18}{ym_mc6WZTqaZrDTIweCgjA}{nE59gZFzS8mrsL3ga6qQqQ}{10.0.0.127}{10.0.0.127:9300}{rack_id=br1515, ml.machine_memory=42193956864, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true, gateway=true},}, reason: apply cluster state (from master [master {es-hay0-04}{Gmw2m6AyQ8WN05zXunQfng}{H3lc7EYAQKe0Eb105MaWIw}{10.0.0.113}{10.0.0.113:9300}{rack_id=br1517, ml.machine_memory=42193956864, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true, gateway=true} committed version [35669]])

A little earlier there are three of these

[2019-04-18T08:38:41,220][WARN ][o.e.c.r.a.AllocationService] [es-hay0-04] failing shard [failed shard, shard [dc-telegraf-logs-2019.04.18][3], node[ym_mc6WZTqaZrDTIweCgjA], [R], s[STARTED], a[id=l6EPHvWoQeWhj41WXxDwoQ], message [failed to perform indices:data/write/bulk[s] on replica [dc-telegraf-logs-2019.04.18][3], node[ym_mc6WZTqaZrDTIweCgjA], [R], s[STARTED], a[id=l6EPHvWoQeWhj41WXxDwoQ]], failure [RemoteTransportException[[es-hay0-18][10.0.0.127:9300][indices:data/write/bulk[s][r]]]; nested: AlreadyClosedException[[dc-telegraf-logs-2019.04.18][3] engine is closed]; nested: FileSystemException[/var/data/es-00/nodes/0/indices/ZjHR1lKVQAW-17SCPVffZA/3/index/_h89.fdx: Read-only file system]; ], markAsStale [true]]
org.elasticsearch.transport.RemoteTransportException: [es-hay0-18][10.0.0.127:9300][indices:data/write/bulk[s][r]]
Caused by: org.apache.lucene.store.AlreadyClosedException: [dc-telegraf-logs-2019.04.18][3] engine is closed
	at org.elasticsearch.index.engine.Engine.ensureOpen(Engine.java:760) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.index.engine.Engine.ensureOpen(Engine.java:769) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:871) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:788) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:755) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:725) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:425) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:393) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:380) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:79) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:637) ~[elasticsearch-6.6.2.jar:6.6.2]

The stack trace continues... I can add more if needed.

Just to be clear, mounting with errors=panic? is to good way to stop ES if there are filesystem issues?

DavidTurner · April 18, 2019, 12:40pm

A_B:

[2019-04-18T08:38:41,220][WARN ][o.e.c.r.a.AllocationService] [es-hay0-04] failing shard [failed shard, shard [dc-telegraf-logs-2019.04.18][3], node[ym_mc6WZTqaZrDTIweCgjA], [R], s[STARTED], a[id=l6EPHvWoQeWhj41WXxDwoQ], message [failed to perform indices:data/write/bulk[s] on replica [dc-telegraf-logs-2019.04.18][3], node[ym_mc6WZTqaZrDTIweCgjA], [R], s[STARTED], a[id=l6EPHvWoQeWhj41WXxDwoQ]], failure [RemoteTransportException[[es-hay0-18][10.0.0.127:9300][indices:data/write/bulk[s][r]]]; nested: AlreadyClosedException[[dc-telegraf-logs-2019.04.18][3] engine is closed]; nested: FileSystemException[/var/data/es-00/nodes/0/indices/ZjHR1lKVQAW-17SCPVffZA/3/index/_h89.fdx: Read-only file system]; ], markAsStale [true]]
org.elasticsearch.transport.RemoteTransportException: [es-hay0-18][10.0.0.127:9300][indices:data/write/bulk[s][r]]
Caused by: org.apache.lucene.store.AlreadyClosedException: [dc-telegraf-logs-2019.04.18][3] engine is closed

This suggests to me that this shard was marked as failed at this point due to a failed write, which should have caused your cluster health to go yellow. I'd also expect the master to try and allocate this shard elsewhere.

errors=panic will halt the whole OS on a filesystem error, which definitely takes the node down. Arguably this is a reasonable response because it's not great to have an unhealthy node limping along in the cluster.

errors=remount-ro marks the filesystem as readonly, which is less conclusive than a kernel panic. It sort of means you'll still be able to search any shards that are still there as long as Elasticsearch doesn't need to write any more data (indexing, obvs, but also things like flushes or merges). As soon as it tries to write something to a shard it'll get an exception which should then fail that shard. It'll also be unable to persist any cluster state updates, and in 7.0 this means it'll be kicked out of the cluster although I think this is not the case in 6.x.

A_B · April 18, 2019, 12:43pm

The cluster did go yellow and there was one shard that was orphaned or left unallocated.

Thank you very much for the info. I will change the mounting options

DavidTurner · April 18, 2019, 12:56pm

Correction, today this is only the case for master-eligible nodes, but that might change in future too.

system · May 16, 2019, 12:56pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Node seems to lock up randomly Elasticsearch	4	491	January 5, 2017
Node sync fails and cluster goes to "red" Elasticsearch	22	2294	June 28, 2021
First steps troubleshooting ES cluster crashes? Elasticsearch	9	3535	March 3, 2018
A ruined shard caused ES node down Elasticsearch	10	914	December 22, 2017
Failing cluster Elasticsearch	4	1502	September 23, 2021

ES node remained green on VM although underlying disk failed

Related topics