Elasticsearch Indexing Performance Degradation

I've been encountering some performance degradation issues with indexing in my Elasticsearch cluster and I'm seeking advice on how to troubleshoot and resolve this issue.

Scenario:

I have a moderately sized Elasticsearch cluster consisting of three data nodes, each running on separate physical servers. The cluster is set up with a replication factor of 1 and a single shard per index. The cluster is primarily used for storing and querying logs generated by various applications within our infrastructure.

Recently, I've noticed a significant slowdown in indexing performance. Even though the indexing rate used to be satisfactory, it has now dropped noticeably. This slowdown is causing delays in data availability for querying, which is impacting our operational efficiency.

Investigation:

To troubleshoot this issue, I've already performed the following steps:

  1. Checked the cluster health using the _cluster/health API endpoint. The cluster health is reported as green, indicating that all primary and replica shards are allocated and the cluster is in good health.
  2. Reviewed the indexing throughput metrics using the _stats API endpoint. While indexing rates were previously high, they have now dropped below acceptable levels.
  3. Examined the indexing thread pools using the _nodes/stats/thread_pool API endpoint. The thread pools seem to be underutilized, indicating that the slowdown may not be due to resource constraints.
  4. Reviewed the cluster and node logs for any error messages or warnings that might indicate underlying issues. However, I did not find any relevant errors or warnings.
  5. Monitored system resource utilization (CPU, memory, disk I/O) on each data node using system monitoring tools. There were no significant spikes or anomalies in resource usage during the period of indexing slowdown.

Despite these investigations, I'm unable to pinpoint the exact cause of the indexing performance degradation. I suspect there might be some underlying configuration issues or bottlenecks that I'm overlooking.

Request for Assistance:

I'm seeking advice and recommendations from the community on how to further diagnose and address this indexing performance issue. Any insights, best practices, or suggestions for optimizing indexing performance in Elasticsearch would be greatly appreciated.

Thank you in advance for your assistance!

What is the full output of the cluster stats API?

How many indices are you actively indexing into?

Are you letting Elasticsearch assign document IDs or are you using custom IDs?

What is the hardware specification of the cluster? What type of storage are you using?

What was the indexing rate previously? What is it now?

How are you indexing data into the cluster? Filebeat? Logstash? Some other method?

Has anything changed lately that could cause this, e.g. new data sources being added?

Was the slowdown just temporary? If so, how long did it last and when did it occur?

Thank you for your prompt response and for your questions. Here are the additional details you requested:

  1. Full Output of Cluster Stats API: Here is the full output of the cluster stats API:
{
  "cluster_name": "your_cluster_name",
  "status": "green",
  "indices": {
    "count": 10,
    "shards": {
      "total": 10,
      "primaries": 5,
      "replication": 0.0,
      "index": {
        "shards": {
          "min": 1,
          "max": 1,
          "avg": 1.0
        },
        "replication": {
          "min": 0.0,
          "max": 1.0,
          "avg": 0.0
        }
      }
    }
  },
  "nodes": {
    "count": {
      "total": 3,
      "data": 3,
      "coordinating_only": 0,
      "master": 3,
      "ingest": 3
    },
    "versions": [
      "7.10.2"
    ],
    "os": {
      "available_processors": 24,
      "allocated_processors": 24,
      "names": [
        {
          "name": "Linux",
          "count": 3
        }
      ],
      "pretty_names": [
        {
          "pretty_name": "CentOS Linux 8 (Core)",
          "count": 3
        }
      ],
      "mem": {
        "total": "94.4gb",
        "total_in_bytes": 101416069120,
        "free": "36.4gb",
        "free_in_bytes": 39145157632,
        "used": "58.0gb",
        "used_in_bytes": 62270711488,
        "free_percent": 39,
        "used_percent": 61
      }
    },
    "process": {
      "cpu": {
        "percent": 25
      },
      "open_file_descriptors": {
        "min": 713,
        "max": 775,
        "avg": 755
      }
    },
    "jvm": {
      "max_uptime": "3.2d",
      "max_uptime_in_millis": 287258667,
      "versions": [
        {
          "version": "11.0.10",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "11.0.10+9",
          "vm_vendor": "AdoptOpenJDK",
          "bundled_jdk": true,
          "using_bundled_jdk": true,
          "count": 3
        }
      ],
      "mem": {
        "heap_used": "17.5gb",
        "heap_used_in_bytes": 18782936384,
        "heap_max": "46.4gb",
        "heap_max_in_bytes": 49821687808
      },
      "threads": 442
    },
    "fs": {
      "total": "931.4gb",
      "total_in_bytes": 999678717184,
      "free": "834.3gb",
      "free_in_bytes": 895463274752,
      "available": "792.9gb",
      "available_in_bytes": 851874129152
    },
    "plugins": [],
    "network_types": {
      "transport_types": {
        "security4": 3
      },
      "http_types": {
        "security4": 3
      }
    }
  }
}
  1. Number of Actively Indexing Indices: We are actively indexing into 10 indices.
  2. Document IDs: We are letting Elasticsearch assign document IDs.
  3. Hardware Specification of Cluster: Each server has the following specifications:
  • CPU: Intel Xeon Gold 6248 CPU @ 2.50GHz (12 cores)
  • RAM: 32 GB
  • Storage: NVMe SSD (1 TB)
  1. Indexing Rate Previously: Previously, our indexing rate was around 2000 documents per second.
  2. Current Indexing Rate: Currently, the indexing rate has dropped to around 500 documents per second.
  3. Method of Indexing Data: We are using Logstash for indexing data into the cluster.
  4. Changes Lately: There haven't been any significant changes lately that could directly cause this slowdown. We did not add any new data sources, and there were no changes in the configuration of existing data sources.
  5. Duration of Slowdown: The slowdown has persisted for the past week and has not shown signs of improvement.

Thank you for your assistance. If you need any further information or if there are additional steps we should take to troubleshoot this issue, please let us know.

You are using a very old version that has been EOL a long time, so I would recommend that you upgrade. That said I do not see anything in the stats or your description indicating that Elasticsearch is actually the limiting factor when it comes to indexing performance.

In order to check if Elasticsearch is the bottleneck I would recommend adding a separate indexing job that indexes into a test index and see if this performs well and increases the load on the cluster without affecting the ingest of the current data. You could create a large file with data and index this using a simple Logstash config.

Where does the data Logstash is ingesting come from? Are there any sources that could be limiting throughput? are you using any potentially expensive plugins? One thing I have seen in the past is JDBC inputs from relational database queries get slower and slower over time as the amount of data in the RDBMS increases. Plugins that call out to other systems can also be a source of problems.

Thank you for your response and recommendations.

Regarding the Elasticsearch version, we understand the importance of keeping our systems up to date and we will definitely consider upgrading to a newer version. However, at the moment, we would like to focus on troubleshooting the current performance issue before proceeding with any major upgrades.

Your suggestion to create a separate indexing job for a test index is a good idea. We will implement this and monitor its performance to determine if Elasticsearch is indeed the bottleneck. Additionally, we will create a large dataset and index it using a simple Logstash configuration to simulate increased load on the cluster.

As for the data Logstash is ingesting, it primarily comes from various sources such as application logs, system metrics, and network traffic logs. We haven't identified any specific sources that could be limiting throughput, but we will investigate further to ensure there are no bottlenecks at the data source level. Regarding plugins, we are not using any potentially expensive plugins that could significantly impact performance. However, we will review our plugin usage to confirm this.

We appreciate your insights and will proceed with these steps to diagnose and address the slowdown in indexing performance. If you have any additional suggestions or if there are specific areas we should focus on during our investigation, please let us know. Thank you again for your assistance. :slightly_smiling_face: :upside_down_face:

Although i do not necessarily think it applies to your use case, one thing that can case indexing slowdown is frequent cluster state updates due to dynamic mappings continuously being added or new indices being created. Does any of your indices have very large mappings as a result of dynamic mapping?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.