Index breaking for a few seconds/minutes

Started to notice that my indexing rate has some breaks which leads delay in the indexing of the documents.

What can be the cause of this?

This could mean that your node is getting overload, those gaps in monitoring usually means that the Elasticsearch node was not able to responde to the monitoring requests.

I suggest the you check the logs for any WAR lines.

Thanks @leandrojmp

Getting some logs like this:

[2025-01-16T01:54:54,226][ERROR][o.e.x.m.c.n.NodeStatsCollector] [suricata-elk02] collector [node_stats] timed out when collecting data: node [fe481yc6SjOs2wUC5LhlgA] did not respond within [10s]
[2025-01-16T01:55:06,525][WARN ][o.e.t.TransportService   ] [suricata-elk02] Received response for a request that has timed out, sent [22.2s/22209ms] ago, timed out [12.2s/12205ms] ago, action [cluster:monitor/nodes/stats[n]], node [{suricata-elk02}{fe481yc6SjOs2wUC5LhlgA}{7rxPVRv7RmGScElf3EQ4tw}{suricata-elk02}{10.61.15.56}{10.61.15.56:9300}{dim}{8.15.1}{7000099-8512000}{xpack.installed=true, transform.config_version=10.0.0, ml.config_version=12.0.0}], id [2332523]
[2025-01-16T01:55:12,380][WARN ][o.e.g.PersistedClusterStateService] [suricata-elk02] writing cluster state took [21088ms] which is above the warn threshold of [10s]; [skipped writing] global metadata, wrote [0] new mappings, removed [0] mappings and skipped [113] unchanged mappings, wrote metadata for [0] new indices and [1] existing indices, removed metadata for [0] indices and skipped [201] unchanged indices
[2025-01-16T02:20:04,440][WARN ][o.e.g.PersistedClusterStateService] [suricata-elk02] writing cluster state took [13205ms] which is above the warn threshold of [10s]; [skipped writing] global metadata, wrote [0] new mappings, removed [0] mappings and skipped [113] unchanged mappings, wrote metadata for [0] new indices and [1] existing indices, removed metadata for [0] indices and skipped [201] unchanged indices
[2025-01-16T02:24:34,248][ERROR][o.e.x.m.c.n.NodeStatsCollector] [suricata-elk02] collector [node_stats] timed out when collecting data: node [fe481yc6SjOs2wUC5LhlgA] did not respond within [10s]
[2025-01-16T02:24:34,747][WARN ][o.e.t.TransportService   ] [suricata-elk02] Received response for a request that has timed out, sent [10.6s/10604ms] ago, timed out [600ms/600ms] ago, action [cluster:monitor/nodes/stats[n]], node [{suricata-elk02}{fe481yc6SjOs2wUC5LhlgA}{7rxPVRv7RmGScElf3EQ4tw}{suricata-elk02}{10.61.15.56}{10.61.15.56:9300}{dim}{8.15.1}{7000099-8512000}{xpack.installed=true, transform.config_version=10.0.0, ml.config_version=12.0.0}], id [2361597]

Yeah, those lines clearly says that your node is overloaded.

[2025-01-16T01:54:54,226][ERROR][o.e.x.m.c.n.NodeStatsCollector] [suricata-elk02] collector [node_stats] timed out when collecting data: node [fe481yc6SjOs2wUC5LhlgA] did not respond within [10s]

[2025-01-16T01:55:12,380][WARN ][o.e.g.PersistedClusterStateService] [suricata-elk02] writing cluster state took [21088ms] which is above the warn threshold of [10s]; [skipped writing] global metadata, wrote [0] new mappings, removed [0] mappings and skipped [113] unchanged mappings, wrote metadata for [0] new indices and [1] existing indices, removed metadata for [0] indices and skipped [201] unchanged indices

How many nodes do you have in your cluster? What are their specs? You may need to increase them.

I have 3 nodes. Two of them have an NFS mount where elastic is writing to. NFS might be the issue here?

What are the specs of the nodes? CPU, RAM, configured HEAP?

I would say that it could be an issue, while it works having a NFS as the data path, I don't think this is recommended as this could add more latency and other issues, the recommendation is always to use local disks.

What is the disk type of the server exporting the nfs?

Node 1: 12 CPUs, 20GB RAM with 10GB heap
Node 2: 8 CPUs, 15GB RAM with 7GB heap
Node 3: 8 CPUs, 15GB RAM with 7GB heap

Not able to answer the NFS disk question, but i think it´s SSDs.

Is there any available setting that i can tune in elastic to help?

In general, you really don’t want to be using NFS.

Far a start, you have even more difficulty to troubleshoot as they can be caused by some entirely unrelated issue impacting your NFS / network performance. Plus all the other NFS complexities, which NFS version, using Kerberos, what’s rsize/wsize, locking, TCP/UDP, …

I once managed an elasticsearch cluster based on NFS on top of ceph. It looked a good idea on paper.