Hi. I have recently observed that one of our Elasticsearch nodes ran out of disk space and fell out of the cluster. Upon investigating it, I found several things that are very, very weird about this situation:
- I haven't found a single mention of warning-stage watermark in the logs - the node abruptly goes into flood stage, and then completely runs out of space soon after.
- When I tried to free some space to restart the node by moving index directories off the filesystem (not recommended, I know), the node freed a lot of space upon starting - enough to put the directories back.
- In this state, the node started successfully and managed to join the cluster. However, upon restarting the service one shard remained in unassigned state, citing lack of disk space for creating a temporary file. Here is the allocation explain result:
shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed&metric=none] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2023-06-03T20:16:59.062Z], failed_attempts[5], failed_nodes[[kLEPW32STiKcs3gyG3fcKg]], delayed=false, last_node[kLEPW32STiKcs3gyG3fcKg], details[failed shard on node [kLEPW32STiKcs3gyG3fcKg]: failed to create shard, failure java.io.IOException: failed to write state to the first location tmp file /elastic_db/indices/hnq2U8bfR2WsE30Y43VrKw/0/state-1.st.tm
Caused by: java.io.IOException: No space left on device
For reference, at the time there was 281 Gb of free disk space at the time, with the corresponding index size of 174 Gb - there should have been more than enough space. I ended up deleting the index to put the cluster back together.
My conclusion is that an enormous temp file came out of nowhere and single-handedly brought the node to its knees, and prevented it from initializing the corresponding index later on. This seems to have occurred on several occasions, all confined to that particular node. Has anyone encountered this problem? Any idea how it may have come about?
Here are the log excerpts:
The log preceding the problem (there are no other entries for that day):
[2023-06-02T03:08:14,188][INFO ][o.e.m.j.JvmGcMonitorService] [data-node-1] [gc][308226] overhead, spent [260ms] collecting in the last [1s]
[2023-06-02T03:08:27,628][INFO ][o.e.m.j.JvmGcMonitorService] [data-node-1] [gc][308238] overhead, spent [426ms] collecting in the last [1s]
[2023-06-02T03:08:28,949][INFO ][o.e.m.j.JvmGcMonitorService] [data-node-1] [gc][young][308239][6167] duration [996ms], collections [1]/[1.3s], total [996ms]/[2.8m], memory [2.6gb]->[2.4gb]/[15gb], all_pools {[young] [176mb]->[0b]/[0b]}{[old] [1.9gb]->[2.4gb]/[15gb]}{[survivor] [456mb]->[48mb]/[0b]}
[2023-06-02T03:08:28,949][WARN ][o.e.m.j.JvmGcMonitorService] [data-node-1] [gc][308239] overhead, spent [996ms] collecting in the last [1.3s]
[2023-06-02T12:10:24,772][INFO ][o.e.m.j.JvmGcMonitorService] [data-node-1] [gc][340724] overhead, spent [254ms] collecting in the last [1s]
Next-day log, flood-stage watermark alert (literally the first log entry):
[2023-06-03T02:21:24,687][WARN ][o.e.x.m.e.l.LocalExporter] [data-node-1] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: org.elasticsearch.cluster.block.ClusterBlockException: index [.monitoring-es-7-2023.06.02] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];
[2023-06-03T02:21:24,838][WARN ][o.e.x.m.MonitoringService] [data-node-1] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
...
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:124) ~[?:?]
[2023-06-03T02:21:34,645][WARN ][o.e.x.m.e.l.LocalExporter] [data-node-1] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: org.elasticsearch.cluster.block.ClusterBlockException: index [.monitoring-es-7-2023.06.02] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];
First out-of-space alert:
Out of space:
[2023-06-04T00:00:04,968][WARN ][o.e.i.s.RetentionLeaseSyncAction] [data-node-1] [.apm-agent-configuration][0] retention lease background sync failed
org.elasticsearch.transport.RemoteTransportException: [data-node-1][192.168.121.67:9300][indices:admin/seq_no/retention_lease_background_sync[p]]
Caused by: org.elasticsearch.gateway.WriteStateException: failed to write state to the first location tmp file /elastic_db/indices/wjO6qs01Ri29KSHjzOChIg/0/retention-leases-20.st.tmp
...
Caused by: java.io.IOException: No space left on device
[2023-06-04T00:00:05,292][WARN ][o.e.i.s.RetentionLeaseSyncAction] [data-node-1] [.kibana-event-log-8.5.2-000002][0] retention lease background sync failed
org.elasticsearch.transport.RemoteTransportException: [data-node-1][192.168.121.67:9300][indices:admin/seq_no/retention_lease_background_sync[p]]
Caused by: org.elasticsearch.gateway.WriteStateException: failed to write state to the first location tmp file /elastic_db/indices/LInEr-9XQDiNfp4rAXVGOQ/0/retention-leases-199.st.tmp
...
Caused by: java.io.IOException: No space left on device
Node failure:
[2023-06-04T00:00:48,928][WARN ][o.e.x.m.MonitoringService] [data-node-1] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
...
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulk [default_local]
... 31 more
Caused by: org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:177) ~[elasticsearch-8.8.0.jar:?]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:638) ~[elasticsearch-8.8.0.jar:?]
... 8 more
...
[2023-06-04T00:00:49,640][WARN ][o.e.c.c.ClusterFormationFailureHelper] [data-node-1] this node is unhealthy: health check failed on [/elastic_db]; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.8/discovery-troubleshooting.html
[2023-06-04T00:00:59,642][WARN ][o.e.c.c.ClusterFormationFailureHelper] [data-node-1] this node is unhealthy: health check failed on [/elastic_db]; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.8/discovery-troubleshooting.html
[2023-06-04T00:01:00,662][WARN ][r.suppressed ] [data-node-1] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];