Elastic cluster data nodes increased search latency

We've provisioned an Elastic cluster within the AWS OpenSearch service, and we have a single major index with replication set to 3.

Recently, we've noticed a significant spike in search request serving times, sometimes reaching up to 2 seconds. Upon investigation, we found that a couple of data nodes were experiencing increased search latency. After restarting these nodes, they returned to normal behavior.

Here are a few observations regarding the affected data nodes:

  • They also experienced increased JVMGCYoungCollectionCount and JVMGCYoungCollectionTime.
  • These data nodes utilize AWS EBS gp2 (SSD) disks, and it appears they were running out of IOPS credits available from EBS.
    Although restarting the data nodes resolved the issue and they began to function normally again, we're puzzled about how a restart could have addressed the underlying problem. It's worth noting that the ES cluster continued to serve the same traffic during this time.

ES version: 7.1

OpenSearch/OpenDistro are AWS run products and differ from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance. See What is OpenSearch and the OpenSearch Dashboard? | Elastic for more details.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns :elasticheart: )

Hello and Welcome,

If you are using Opensearch, then you need to ask this on an Opensearch forum.

Opensearch has custom code done mostly by AWS.

our cluster's engine is Elasticsearch7.1 and is being hosted on AWS managed service (ie: opensearch service), that's the reason i have posted it here.

AWS as far as i know run Elasticsearch with custom plugins, so it is not standard Elasticsearch.

This type of storage can have very limited IOPS, especially if the volumes are small, and can quickly become a bottleneck. They are however able to burst to higher IOPS for a short period of time, so it may be that you have hit this limit and the restart reset the bursting calculation. Upgrading to gp3 storage is probably recommended.

BTW did you look at Cloud by Elastic, also available if needed from AWS Marketplace, Azure Marketplace and Google Cloud Marketplace?

Cloud by elastic is one way to have access to all features, all managed by us. Think about what is there yet like Security, Monitoring, Reporting, SQL, Canvas, Maps UI, Alerting and built-in solutions named Observability, Security, Enterprise Search and what is coming next :slight_smile: ...