Cluster design with fully replicated data - please advise


(Jan Janiczek) #1

I have what seems to be a rather atypical use case for ES - a relatively small (<100 gb) dataset, so that it is perfectly possible to give each server a full copy of the data. The cluster will have to handle a moderate amount of new data but this will be balanced by deletions of old data, so assuming I schedule regular force merges, the size of the dataset should be pretty stable. I also need to be able to support a high number of searches, ideally over a thousand per second, so I want to optimize for that.

I've searched all over the internet for advice on ES cluster design and the commonly approved "best practice" seems to be to have:
3 small servers as master nodes
2+ big servers as data nodes, with more added as data volume grows
1-2 http nodes

However, all of this advice seems designed for a typical ES use case that is quite different from my own - I don't have a massive dataset that I would need to spread across multiple servers, with all the coordination overhead that entails, and I don't expect this dataset to grow much, so if I add more nodes it would be to increase search throughput, not to store more data.

So, my question is - does it make sense to keep all three node functions (http, data, master) separate in my case? Especially the separate http nodes, given that each data node has the complete dataset and is able to fully answer any query, so there is no need to split queries or reduce answers, is it still worth it to have them?

The design I am considering now is:
3 small servers as master nodes
3+ big servers as data/http nodes, each with with the full dataset

Would that work fine? Also, am I correct to assume that with all data nodes having identical data, the cluster state will be tiny and the master nodes can thus be really small?


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.