We have an Elasticsearch 7.1 cluster (AWS) with 2 indexes. The parent index has one shard and two replicas, whereas the child index has 3 shards and 2 replicas. Right now we’re running on 8 data nodes and the shard allocation is as follows:
We’re seeing that some of the nodes (4-8) aren’t participating in search requests. Is this because we have too many nodes? Will this hurt performance?
We also really don't have enough data (35GB) to justify 3 shards on the child index but left room for growth. Could this also hurt performance?
The cluster performs ok most of the time, but when there's a spike in traffic clients experience 504 gateway timeouts for a few minutes. I'm trying to determine if the cluster configuration contributes to this or if its a load balancer issue.
For only 35GB this strikes me as a very big cluster, but how do you know it's not using nodes 4-8? I assume using explain?
What is your query rate, as if very rare it might use the shards it has data on, though I think that is maintained at the coordinator level, so if the query goes to various nodes they keep their own recent history & performance lists (though I've not found it in the code yet).
If you get 504, do you check the Elasticsearch logs to see if you are getting timeouts or maybe breakers or other errors? And this continues, like you only get 504s for a few min, then it works again?
That seems odd, as if cluster runs out of power but data is so small and cluster so large it's hard to believe unless you are overloading the coordinator nodes, suggest watch queues, CPU there.
Also what is your heap size on these? If only 1GB or something small, might just be out of RAM.
I'm writing a blog on how search data flow & selection works, so very interested in this topic.
The AWS console shows 0 for search rate (operations/min) on these nodes.
The search rate has been running up to 60 operations/min but the spike we saw was over 3000.
Unfortunately we didn't have the logs enabled as that's not the default setting. We'll look at doing that.
The data nodes never got above 30% cpu and JVM memory pressure climbed to about 50%. The data node instances have 32GiB memory. It looks like the heap size is 17GB for the data nodes.
I can share the results of /_nodes/stats if that helps.
Ah, it's AWS version, but 3000 queries/min is a lot, especially up from 60 - not sure a cluster can answer that fast, especially if the queries take many seconds as at 5 sec each that's 250 active queries.
Certainly has plenty of RAM - 35GB of data in 8 nodes of 17G each or so - everything should be in RAM and very fast. Though I doubt it can suddenly scale to thousands/minute - do queries go to all nodes, i.e. via load balancing, or do you have separate coordinating nodes?
In Elasticsearch 7.x you can split indices in order to handle a reasonable amount of growth, so one approach here would be to change both parent and child indices to 1 primary shard (assuming the data sets are small and that this still meets you query latency requirements) and then instead increase the number of replicas to 5 so all nodes hold a full copy of all data.
Typically you split indices into more primary shards in order to improve latency as large shards might get CPU bound and increase the number of replica shards to improve query thoughput.
If this is hosted on AWS Elasticsearch service I however have no insight into how they tune their clusters nor how the handle zoning and load balancing.
I can easily add more replica shards to the cluster. Changing the number of primary shards would require a shrink or re-index. Do you think that would be worth the effort?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.