Single Elasticsearch node abruptly ran out of disk space (possibly due to a giant temp file)

Hi. I have recently observed that one of our Elasticsearch nodes ran out of disk space and fell out of the cluster. Upon investigating it, I found several things that are very, very weird about this situation:

  1. I haven't found a single mention of warning-stage watermark in the logs - the node abruptly goes into flood stage, and then completely runs out of space soon after.
  2. When I tried to free some space to restart the node by moving index directories off the filesystem (not recommended, I know), the node freed a lot of space upon starting - enough to put the directories back.
  3. In this state, the node started successfully and managed to join the cluster. However, upon restarting the service one shard remained in unassigned state, citing lack of disk space for creating a temporary file. Here is the allocation explain result:
shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [POST /_cluster/reroute?retry_failed&metric=none] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2023-06-03T20:16:59.062Z], failed_attempts[5], failed_nodes[[kLEPW32STiKcs3gyG3fcKg]], delayed=false, last_node[kLEPW32STiKcs3gyG3fcKg], details[failed shard on node [kLEPW32STiKcs3gyG3fcKg]: failed to create shard, failure java.io.IOException: failed to write state to the first location tmp file /elastic_db/indices/hnq2U8bfR2WsE30Y43VrKw/0/state-1.st.tm
Caused by: java.io.IOException: No space left on device

For reference, at the time there was 281 Gb of free disk space at the time, with the corresponding index size of 174 Gb - there should have been more than enough space. I ended up deleting the index to put the cluster back together.

My conclusion is that an enormous temp file came out of nowhere and single-handedly brought the node to its knees, and prevented it from initializing the corresponding index later on. This seems to have occurred on several occasions, all confined to that particular node. Has anyone encountered this problem? Any idea how it may have come about?

Here are the log excerpts:

The log preceding the problem (there are no other entries for that day):

[2023-06-02T03:08:14,188][INFO ][o.e.m.j.JvmGcMonitorService] [data-node-1] [gc][308226] overhead, spent [260ms] collecting in the last [1s]
[2023-06-02T03:08:27,628][INFO ][o.e.m.j.JvmGcMonitorService] [data-node-1] [gc][308238] overhead, spent [426ms] collecting in the last [1s]
[2023-06-02T03:08:28,949][INFO ][o.e.m.j.JvmGcMonitorService] [data-node-1] [gc][young][308239][6167] duration [996ms], collections [1]/[1.3s], total [996ms]/[2.8m], memory [2.6gb]->[2.4gb]/[15gb], all_pools {[young] [176mb]->[0b]/[0b]}{[old] [1.9gb]->[2.4gb]/[15gb]}{[survivor] [456mb]->[48mb]/[0b]}
[2023-06-02T03:08:28,949][WARN ][o.e.m.j.JvmGcMonitorService] [data-node-1] [gc][308239] overhead, spent [996ms] collecting in the last [1.3s]
[2023-06-02T12:10:24,772][INFO ][o.e.m.j.JvmGcMonitorService] [data-node-1] [gc][340724] overhead, spent [254ms] collecting in the last [1s]

Next-day log, flood-stage watermark alert (literally the first log entry):

[2023-06-03T02:21:24,687][WARN ][o.e.x.m.e.l.LocalExporter] [data-node-1] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: org.elasticsearch.cluster.block.ClusterBlockException: index [.monitoring-es-7-2023.06.02] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];
[2023-06-03T02:21:24,838][WARN ][o.e.x.m.MonitoringService] [data-node-1] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
...
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents
        at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:124) ~[?:?]
[2023-06-03T02:21:34,645][WARN ][o.e.x.m.e.l.LocalExporter] [data-node-1] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: org.elasticsearch.cluster.block.ClusterBlockException: index [.monitoring-es-7-2023.06.02] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];

First out-of-space alert:

Out of space:
[2023-06-04T00:00:04,968][WARN ][o.e.i.s.RetentionLeaseSyncAction] [data-node-1] [.apm-agent-configuration][0] retention lease background sync failed
org.elasticsearch.transport.RemoteTransportException: [data-node-1][192.168.121.67:9300][indices:admin/seq_no/retention_lease_background_sync[p]]
Caused by: org.elasticsearch.gateway.WriteStateException: failed to write state to the first location tmp file /elastic_db/indices/wjO6qs01Ri29KSHjzOChIg/0/retention-leases-20.st.tmp
...
Caused by: java.io.IOException: No space left on device
[2023-06-04T00:00:05,292][WARN ][o.e.i.s.RetentionLeaseSyncAction] [data-node-1] [.kibana-event-log-8.5.2-000002][0] retention lease background sync failed
org.elasticsearch.transport.RemoteTransportException: [data-node-1][192.168.121.67:9300][indices:admin/seq_no/retention_lease_background_sync[p]]
Caused by: org.elasticsearch.gateway.WriteStateException: failed to write state to the first location tmp file /elastic_db/indices/LInEr-9XQDiNfp4rAXVGOQ/0/retention-leases-199.st.tmp
...
Caused by: java.io.IOException: No space left on device

Node failure:

[2023-06-04T00:00:48,928][WARN ][o.e.x.m.MonitoringService] [data-node-1] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
...
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulk [default_local]
        ... 31 more
Caused by: org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
        at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:177) ~[elasticsearch-8.8.0.jar:?]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:638) ~[elasticsearch-8.8.0.jar:?]
        ... 8 more
...
[2023-06-04T00:00:49,640][WARN ][o.e.c.c.ClusterFormationFailureHelper] [data-node-1] this node is unhealthy: health check failed on [/elastic_db]; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.8/discovery-troubleshooting.html
[2023-06-04T00:00:59,642][WARN ][o.e.c.c.ClusterFormationFailureHelper] [data-node-1] this node is unhealthy: health check failed on [/elastic_db]; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.8/discovery-troubleshooting.html
[2023-06-04T00:01:00,662][WARN ][r.suppressed             ] [data-node-1] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];

Almost forgot - we are currently using ES 8.8.0, but it seems like the problem also took place before we upgraded from 8.5.2.

Did you try the advice in the allocation explain message first?

manually call [POST /_cluster/reroute?retry_failed&metric=none] to retry

I think that would have been enough.

These are all from the data node. I think you need to look in the master node logs.

Your temp-file theory is possible, but Elasticsearch doesn't create humongous temp files as a matter of course so are you sure it's not something else? The only way I could see this happening is if you force-merged one or more enormous shards.

I've only just now realized that the shard initialization was done on the cluster level, not node level, so restarting the node would not be enough to retry the allocation - so I guess you're right, I should have done that.

Indeed, the warning watermark logs are there - not sure why they wouldn't be in the data node logs, but I guess that solves the question of how the problem occurred. Thanks for your help (I do feel silly now).

One thing I still don't understand is why the problem consistently takes place on that particular node, but I guess I'll look into log ingestion and data relocation.

The data node doesn't actually know about the disk watermarks, they're all computed on the master. And even if it did, it still wouldn't know whether the problem will be solved by moving shards around or not, which determines how much noise the disk usage makes in the logs.

Gotcha, thank you.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.