Experiencing a delay in Elasticsearch index logs

We are experiencing a delay in logs being ingested into the Elasticsearch. While investigating we are getting below mentioned Error in Elasticsearch logs.
Logs are being ingested from the application via Filebeat, processed through Logstash, and then sent to Elasticsearch.

[2024-10-2318:58:54,970][ERROR][o.e..m.c.c.ClusterStatsCollector] [yball1v355ca09] collector [cluster_stats] timed out when collecting data: node [OA1DKLJoT4yfNNW-p9kkJg] did not respond within [10s]

Kindly let me know how this can be resolved.

Thanks,
Shaista

Hi Shaista, Delay can be occur because of multiple reasons.

  1. Could you help us with your cluster (per node) configurations? No. of Indexes / Shards ?
  2. How is your hardware utilizations per node? Are they fully utilized?
  3. Could you share the result of _cluster/stats?human&pretty ?

The error message you’re seeing indicates that the Elasticsearch node is timing out when attempting to collect cluster statistics, which can affect the ingestion of logs. Here are some steps to help you troubleshoot and resolve this issue:

1. Check Node Health

  • Use the _cat/nodes API to check the health of your Elasticsearch nodes:

bash

Copy code

GET /_cat/nodes?v
  • Ensure that all nodes are up and running and that there are no nodes marked as red or yellow.

2. Monitor Resource Usage

  • Check the resource usage (CPU, memory, disk I/O) on the Elasticsearch nodes. High usage could lead to timeouts.
  • Use the following command to view the node stats:

bash

Copy code

GET /_nodes/stats

3. Increase Timeout Settings

  • If the timeout is due to temporary load spikes, consider increasing the timeout for cluster stats collection. You can do this by adjusting the following settings:

yaml

Copy code

cluster.stats.timeout: 30s
  • Update your elasticsearch.yml configuration file and restart the node to apply the changes.

4. Check Logstash Performance

  • Since logs are being processed through Logstash, monitor its performance and resource usage. If Logstash is slow, it can back up the log ingestion pipeline.
  • Ensure that Logstash is configured correctly and that it can handle the volume of logs being ingested.

5. Review Filebeat Configuration

  • Check the Filebeat configuration for any issues that could be causing delays. Ensure that it is configured to send logs efficiently.
  • Consider increasing the bulk_max_size setting in Filebeat to optimize the volume of data sent to Logstash.

6. Cluster Configuration

  • Review your Elasticsearch cluster configuration for any potential bottlenecks, such as insufficient node resources or improper shard allocation.
  • Consider adjusting the number of shards or replicas if the index is heavily loaded.

7. Logs Analysis

  • Examine the Elasticsearch logs for any additional error messages or warnings that might provide further context on the issue.
  • Look for logs indicating garbage collection or other resource-related issues.

8. Upgrade Considerations

  • If you are running an older version of Elasticsearch or the Elastic stack components, consider upgrading to the latest stable version to benefit from performance improvements and bug fixes.