(EDIT) Can Elasticsearch Scale to 1 LAC QPS (Query Per Sec)?


(NM) #1

Hi, All.

We are using 3 node (master+data) cluster
AWS LoadBalancer (Hardware Machine) in front of our 3 node cluster.

Nodes Configuration:

  • Ram: 32 GB (50% to ES and remaining 50% to OS)
  • Cores: 16 for each node
  • Shards: 3
  • Replicas: 1
  • Index: 1
  • Queries: Basic Queries (No wildcard, aggregations etc.)
  • Analyzers: Yes (4)
  • Tokenizers: Yes (2)
  • Filter: Yes (2)
  • ngrams: Yes (both front and back ngrams are used)
  • Total Data-Size: < 200 MB (Very Small and it will always be small)

With this we can serve 2500 QPS (Query Per Second), now we want to achieve 1 LAC QPS

So my question is

  1. Do we need to increase replica to achieve our goal?
  2. Do we need to increase Shards from 3 to 12, to achieve our goal?
  3. Do we need to do both 1 & 2 to achieve our goal?
  4. Do we need to increase nodes from 3 to 12 to achieve our goal?
  5. Do we need to avoid AWS LB and start using ES built-in load-balancer (client node), then our topology will become 12 node cluster with 2 client nodes, 7 master nodes and rest acting as data nodes, will it achieve our goal?

Please don't say try all and figure out yourself, because i cannot do that for certain reasons, if at all anyone has faced and/or solved this problem, help to understand and solve this scaling problem.

Lets Learn And Grow Together.
Cheers
Thanks.


(Mark Walkom) #2

You won't get a single solution here because your use, your data and your requirements are different from everyone else, you will need to try some of this yourself.

Are you seeing problems with query rates over 1K/s? What does the load look like on your current setup and use?


(Christian Dahlqvist) #3

Query throughout generally scales with the number of replicas as more work can be done in parallel. As your data set is small enough to easily be cached in the file system cache, I would recommend increasing the number of replicas so that all nodes hold all data. Once this is done you should be able to ensure that the node receiving the request serves it based on the local data and avoids network traffic by specifying "_local" preference in your queries.

Once this is done I would recommend running benchmarks to see what is limiting performance and query throughput. If it e.g. turns out to be CPU, you may consider moving to a different instance type with more CPU cores.

Further increasing the number of nodes that can serve the data will also increase throughput, so adding additional data nodes is likely to bring more benefit than adding client nodes. Since all nodes hold all data it should scale almost linearly.


(NM) #4

@Christian_Dahlqvist Thanks for the answer, the problem here is people want exact numbers based on some analysis, I cannot simply go and change config every now and then (like changing nodes to 5 then 10 and so on.) because here there are lot of dependencies, so i think there should be some alternative where based on above data which i have provided (now edited for updating latest values) we can build a near-perfect math which says we should go with n nodes, m shard and o replicas which might achieve approx 1 Lac QPS.

If not clear please ask.
Thanks


(Mark Walkom) #5

You won't get that level of an answer here I'm sorry to say.


(Christian Dahlqvist) #6

There are too many parameters and settings involved that affect performance, making it impossible to build any kind of analytical model that can be used to accurately determine required cluster size. Benchmarking is therefore almost always required.

As you data set is small and all data easily can be held in memory on every node, you should be able to benchmark on a small cluster, possibly even a single node, that holds all data. You can then optimise the performance and try to identify the most cost effective instance type for your use case. Once you know how many QPS each node can handle, you should be able to use this to reasonably accurately estimate the number of nodes required.

As your data is so small you may also try increasing your heap size beyond 50% as you will be able to utilise very little operating system file cache due to the size of your data.


(NM) #7

@Christian_Dahlqvist cool, will try it out.


(system) #8