Large Capacity Sizing: Single Cluster vs Multiple Clusters, Index Sizing, Memory Problems

Hiya all...

Some quick scaling questions to get best practices. We deal with a very large pipeline of events every second and doing another round of evaluation now that we have moved to ES6. I will share our current setup. Love to get thoughts and recommendations.

Average Event Size: 1.8kB
Typical Event Per Second: 300k EPS (projected to 1 million EPS by end of next year)
3 Racks of 12 servers split at rack into separate clusters.
Each Server:

  • 56 CPUs

  • 768 GB Memory

  • 12 spinning disks split into 6 RAID0

  • 6 ES Data instances

  • CPU pinning giving each Instance 9 cpus and 2 cpus dedicated to system operations

  • Utilizing the Elasticsearch Docker image on Centos 7

  • Support of 30 days worth of data at 100k EPS per rack

Each Cluster and Index Level:
5 Dedicated Coordinator Nodes
5 Dedicated Master Nodes
72 Data Nodes

Each Index is time bucketed at 3 hours with 24 shards / 1 replica (Shards are 30 - 50gb)
Each cluster runs around 25k Shards and 150 Billion Documents

Questions:

  • Multiple Clusters or one giant cluster??
    • I have found that crossing around 100 data nodes there were some very interesting performance problems and I could achieve better ingest on multiple clusters instead of a single large clusters on ES5. Any improvements on ES6 or future ES7 to rethink this question?
  • Small amount of large indexes with lots of shards or lots of smaller indexes with less shards? (Same amount of shards between the two forms)
    • I utilize search aliases and bucketing end user searches to limit index hits. Larger indexes will mean search aliases will mean less value.
  • Recommendations on Memory heap problems. We are running in to memory problems due to a mix of fielddata and segments with the current design. This is limiting length of data we can search. This is also why we keep adding instances to the servers because I cannot get more hardware but need memory.

The fact that you have spinning disks across the board may be limiting your cluster. Have you looked at disk I/O and iowait? In order to minimise heap usage and be able to handle and query larger data volumes, you way want to make sure you follow the guidelines outlined in this webinar.

25k shards of an average size of 40GB across 72 nodes gives about 13.5TB per node. Is that what you have?

Thank you @Christian_Dahlqvist for responding.

I have to live with current hardware limitations. As to disk IO though, I am not seeing any specific problems. We are typically around 2% IOWAIT with the rare spikes up to 5%.

Each ES data instance has access to 13.8TB so really close. :smiley:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.