Hiya all...
Some quick scaling questions to get best practices. We deal with a very large pipeline of events every second and doing another round of evaluation now that we have moved to ES6. I will share our current setup. Love to get thoughts and recommendations.
Average Event Size: 1.8kB
Typical Event Per Second: 300k EPS (projected to 1 million EPS by end of next year)
3 Racks of 12 servers split at rack into separate clusters.
Each Server:
-
56 CPUs
-
768 GB Memory
-
12 spinning disks split into 6 RAID0
-
6 ES Data instances
-
CPU pinning giving each Instance 9 cpus and 2 cpus dedicated to system operations
-
Utilizing the Elasticsearch Docker image on Centos 7
-
Support of 30 days worth of data at 100k EPS per rack
Each Cluster and Index Level:
5 Dedicated Coordinator Nodes
5 Dedicated Master Nodes
72 Data Nodes
Each Index is time bucketed at 3 hours with 24 shards / 1 replica (Shards are 30 - 50gb)
Each cluster runs around 25k Shards and 150 Billion Documents
Questions:
- Multiple Clusters or one giant cluster??
- I have found that crossing around 100 data nodes there were some very interesting performance problems and I could achieve better ingest on multiple clusters instead of a single large clusters on ES5. Any improvements on ES6 or future ES7 to rethink this question?
- Small amount of large indexes with lots of shards or lots of smaller indexes with less shards? (Same amount of shards between the two forms)
- I utilize search aliases and bucketing end user searches to limit index hits. Larger indexes will mean search aliases will mean less value.
- Recommendations on Memory heap problems. We are running in to memory problems due to a mix of fielddata and segments with the current design. This is limiting length of data we can search. This is also why we keep adding instances to the servers because I cannot get more hardware but need memory.