ES node remained green on VM although underlying disk failed

Hello,

I faced a situation today where ES health checks remained green although one node had definite problems (which caused Logstash to block, etc).

The scenario
All my ES nodes run as libvirt guest VMs. Each VM has a dedicated SSD drive which is made into a single LVM for the VM to use. One of the SSDs failed but the VM and ES inside it continued running but could not write new data (I assume, as I was not able to access that VM over ssh anymore).
ES node status remained green on the VM that was using the failed disk.

Has anyone else had a similar problem? Is there any setting to force ES to stop in certain situation? Or is this more of a libvirt issue?

Any suggestions welcome :slight_smile:

Hi @A_B,

Disk failures manifest in many different ways, and sometimes it is hard for an application to tell that anything is really wrong. Do you have logs from Elasticsearch so we can see what this failure looked like from its point of view?

Are you mounting the filesystem with errors=remount-ro or errors=panic? This is a good way to stop an Elasticsearch node when there are signs of trouble.

Hi @DavidTurner,

thank you very much for the quick reply :slight_smile:

I will check how the filesystem is mounted.

I have shut down the whole VM so not sure if I can get at the local ES logs. But I can give it a go. I should be able to mount the disk directly from the libvirt host.

The failed drive is not accessible anymore so those Elasticsearch logs are gone :confused:

On a identical VM I checked the mounting config

/dev/mapper/main-root on / type ext4 (rw,relatime,errors=remount-ro,data=ordered)

On other nodes in the cluster I see these as I killed the "bad" VM named es-hay0-18 (as expected)

[2019-04-18T09:24:58,108][INFO ][o.e.c.s.ClusterApplierService] [es-hay0-19] removed {{es-hay0-18}{ym_mc6WZTqaZrDTIweCgjA}{nE59gZFzS8mrsL3ga6qQqQ}{10.0.0.127}{10.0.0.127:9300}{rack_id=br1515, ml.machine_memory=42193956864, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true, gateway=true},}, reason: apply cluster state (from master [master {es-hay0-04}{Gmw2m6AyQ8WN05zXunQfng}{H3lc7EYAQKe0Eb105MaWIw}{10.0.0.113}{10.0.0.113:9300}{rack_id=br1517, ml.machine_memory=42193956864, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true, gateway=true} committed version [35669]])

A little earlier there are three of these

[2019-04-18T08:38:41,220][WARN ][o.e.c.r.a.AllocationService] [es-hay0-04] failing shard [failed shard, shard [dc-telegraf-logs-2019.04.18][3], node[ym_mc6WZTqaZrDTIweCgjA], [R], s[STARTED], a[id=l6EPHvWoQeWhj41WXxDwoQ], message [failed to perform indices:data/write/bulk[s] on replica [dc-telegraf-logs-2019.04.18][3], node[ym_mc6WZTqaZrDTIweCgjA], [R], s[STARTED], a[id=l6EPHvWoQeWhj41WXxDwoQ]], failure [RemoteTransportException[[es-hay0-18][10.0.0.127:9300][indices:data/write/bulk[s][r]]]; nested: AlreadyClosedException[[dc-telegraf-logs-2019.04.18][3] engine is closed]; nested: FileSystemException[/var/data/es-00/nodes/0/indices/ZjHR1lKVQAW-17SCPVffZA/3/index/_h89.fdx: Read-only file system]; ], markAsStale [true]]
org.elasticsearch.transport.RemoteTransportException: [es-hay0-18][10.0.0.127:9300][indices:data/write/bulk[s][r]]
Caused by: org.apache.lucene.store.AlreadyClosedException: [dc-telegraf-logs-2019.04.18][3] engine is closed
	at org.elasticsearch.index.engine.Engine.ensureOpen(Engine.java:760) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.index.engine.Engine.ensureOpen(Engine.java:769) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:871) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:788) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:755) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:725) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:425) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:393) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:380) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:79) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:637) ~[elasticsearch-6.6.2.jar:6.6.2]

The stack trace continues... I can add more if needed.

Just to be clear, mounting with errors=panic? is to good way to stop ES if there are filesystem issues?

This suggests to me that this shard was marked as failed at this point due to a failed write, which should have caused your cluster health to go yellow. I'd also expect the master to try and allocate this shard elsewhere.

errors=panic will halt the whole OS on a filesystem error, which definitely takes the node down. Arguably this is a reasonable response because it's not great to have an unhealthy node limping along in the cluster.

errors=remount-ro marks the filesystem as readonly, which is less conclusive than a kernel panic. It sort of means you'll still be able to search any shards that are still there as long as Elasticsearch doesn't need to write any more data (indexing, obvs, but also things like flushes or merges). As soon as it tries to write something to a shard it'll get an exception which should then fail that shard. It'll also be unable to persist any cluster state updates, and in 7.0 this means it'll be kicked out of the cluster although I think this is not the case in 6.x.

The cluster did go yellow and there was one shard that was orphaned or left unallocated.

Thank you very much for the info. I will change the mounting options :+1:

Correction, today this is only the case for master-eligible nodes, but that might change in future too.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.