Hello everyone,I'm going to set up a cluster for Elasticsearch. My data size is approximately 50 TB and will continue to grow. I will primarily use it for search and cross-queries. My question is:I will be using virtual machines.Would 60 servers with 64 GB RAM each be better, or 30 servers with 128 GB RAM each?How should I decide?Thank you.
It Dependsā¢, sorry, either setup could turn out to be better depending on your specific workload and goals. At this kind of scale even a tiny difference in usage can tip the balance one way or the other.
I will feed the Elasticsearch part from my core system, which is mssql. A single index with parent-child relationships can hold 7-8 billion pieces of data. What would your recommendation be in this case?
@DavidTurner answered as best we can, based on such limited info, already.
It will depend on the specific docs and their structure, how they will be mapped to indices, the volume and complexity of the most critical queries, and whatever you mean by cross-queries, any aggregations, and how quick your solution needs to be at indexing, resolving queries, aggregating data, ⦠And even IF you supplied quite a lot of detail on that, answer would still be āIt Dependsā¢ā, as the specifics of the VMs (how much CPU, how fast the storage, etc) will come into focus.
You only offer 2 options: 60 servers (VMs) with 64 GB RAM or 30 servers (VMs) with 128 GB RAM. Between those 2 options, assuming the VMs are the same except for RAM, Iād take the 64 VMs as I would have (likely) more aggregate IO bandwidth, more total CPU, and the same total memory.
But @celikbaris61, before all of that Iād want to make sure I have the right experience and skillsets in my team to manage the deployment, testing, integration, and tuning. And BAU Operations after that. Asking such a ābareā question is (sorry) not a great sign that you are in that place.
Using parent-child relationships adds a lot of complexity and overhead, and how it scales and performs will depend a lot on the nature of your data and queries. The cardinality of parent and child entities will also affect how well data can be distributed across shards and nodes and there is always a risk of ending up with uneven shard sizes and hotspots.
This is an approach often considered when moving from or complementing a relational database, but trying to replicate a relational structure in Elasticsearch is in my experience rarely the best option. It is instead generally recommended to denormalise and flatten the data wherever possible, and at this scale this is something I would strongly recommend looking into.
If you can provide some details abouit the size and complexity of the data model users here may be able to provide some basic guidance, but any detailed discussion about data modelling likely requires extensive information about data, relationships, queries and update patterns, which likely is beyond what can be handled in the setting of this forum.
@RainTown and @Christian_Dahlqvist and @DavidTurner Your answers have been a source of light for me. Thank you to everyone who helped.
By the way, I am also against using it as a relational database, but software developers want a lot of parent-child related data, and this is a huge burden for me.
That is often problematic, so I would recommend running a POC at small scale (multiple nodes and shards though) before going down that route and deploy a substantial cluster for a solution that may not work at all.
Do the developers have experience with Elasticsearch and the use of parent-child relationships or are they maybe relatively new to the technology?
Unless something has changed in more recent versions I believe parent-child relations require a fair bit of additional heap space, so this may push you towards smaller nodes with more total heap.
These are questions best considered by (experienced) architects.
I once inherited a solution where developers had decided a shared ceph storage cluster, with a small SSD cache tier in front of large HDDs, was a great idea as the singular (and therefore common) storage for several mongdb and couchbase clusters, Oracle, the entire application footprint (Openshift at the time), and the logging infra for same, which included Kafka, logstash and elasticsearch. It worked great in a dev environment. It worked great until real customer load was added. The devs all ran a mile and said storage infra had nothing to do with them when Operations started screaming. I therefore agree with @Christian_Dahlqvist that you start with a POC, but would urge not to go ātoo smallā on that, it has to be as representative as possible.
More of the nodes with smaller RAM, the 64GB VMs from @celikbaris61 ās 2 options, because it is recommended that elasticsearchās JVM heap should not exceed ~32GB, per node obviously, irrespective of how much memory your node has.
@Christian_Dahlqvist They used it before, but at that time it was 6.6; now I'm considering using 9.1.x.
@RainTown I understand what you're saying. I'll try to achieve the best architecture. Thank you.
OK, that is good to know. If they have used it before they should be aware of most of the potential issues and compromises, because as far as I know the majority of them has not fundamentally changed in quite some time. I hope their experience was gained with a use case of significant size so they got to address scaling and performance issues.
You're right, the fundamentals haven't changed.
They just need to adapt to the features that came with version 9.x.
What was cluster size/scale at the time of the 6.6 deployment, and what are the main differences between then and now? Might also be interesting to know why it is not in use anymore, if that is the case (it is implied, but not stated). If that is the case, then you had effectively a POC++, from which can maybe learn quite a lot.
when youāre dealing with 50 TB and billions of records, the choice between many smaller nodes vs fewer bigger nodes really comes down to I/O, network and query patterns. Since youāre using parent-child relations and heavy cross-queries, go with whichever setup gives you more bandwidth and less contention. Before locking in, Iād recommend a proof-of-concept cluster that mimics your actual query load and data shape youāll learn a lot from that.