I'm working with a very small index. It is only 65000 documents and is about 50MB in size. Index is also very "stable" since it is only written once a day and does not receive any writes meanwhile. Because of that, I do not have to worry about the performance for indexing.
My goal is to maximize performance for search: number of concurrent searches and the search latency. High availability is a nice bonus, but not the most important part.
I have read extensive number of guides, documentations and tutorials about this subject. I have also benchmarked several different setups. However, since my use case with very small index seems to be so uncommon, I do not know the suitable "basic setup" to start with. For example it is usually recommended to have 1 shard for 40G of data. But on the other hand there should be at least 1 shard per node...
I have now 3 node cluster with 1 shard and 3 read replicas. I have also experimented and benchmarked with other options. I will almost always end up with somehow unbalanced setup with only 2 nodes actually taking the load and 1 staying idle.
What would be your recommendations for basic setup for my use case from where I could begin? Number of nodes? Number of shards? Number of replicas? Anything else I should know? )
You probably want a 3 node cluster where all nodes have the same profile (master/data). As you have a single small index I would stick with 1 primary shard and 2 replica shards. Make sure your clients are set up to load balance requests across all three nodes in the cluster. You may also consider using '_local' preference (do not think this is default).
This way all nodes hold a copy of the data and can serve it locally. As there is a single primary shard you optimize the number of concurrent searches the cluster can handle and the shard size should not cause any performance issues.
I have not defined node.roles at all at the moment, so I think that all nodes has now multiple roles (master, data, data_content, data_hot , ingest, ml etc....) as a default. Should I specifically define node.roles to [ master, data ] for all nodes instead of this default setup? Do I need to mark all nodes as master or just one node?
I'm using REST API instead of "direct" client library. I think, that load balancing is set upped there out-of-the-box, am I right?
I'm excited to benchmark these new settings! It is great to get a decent starting point, so thank you very much for your input!