I'm working with a very small index. It is only 65000 documents and is about 50MB in size. Index is also very "stable" since it is only written once a day and does not receive any writes meanwhile. Because of that, I do not have to worry about the performance for indexing.
My goal is to maximize performance for search: number of concurrent searches and the search latency. High availability is a nice bonus, but not the most important part.
I have read extensive number of guides, documentations and tutorials about this subject. I have also benchmarked several different setups. However, since my use case with very small index seems to be so uncommon, I do not know the suitable "basic setup" to start with. For example it is usually recommended to have 1 shard for 40G of data. But on the other hand there should be at least 1 shard per node...
I have now 3 node cluster with 1 shard and 3 read replicas. I have also experimented and benchmarked with other options. I will almost always end up with somehow unbalanced setup with only 2 nodes actually taking the load and 1 staying idle.
What would be your recommendations for basic setup for my use case from where I could begin? Number of nodes? Number of shards? Number of replicas? Anything else I should know? )
You probably want a 3 node cluster where all nodes have the same profile (master/data). As you have a single small index I would stick with 1 primary shard and 2 replica shards. Make sure your clients are set up to load balance requests across all three nodes in the cluster. You may also consider using '_local' preference (do not think this is default).
This way all nodes hold a copy of the data and can serve it locally. As there is a single primary shard you optimize the number of concurrent searches the cluster can handle and the shard size should not cause any performance issues.
I have not defined node.roles at all at the moment, so I think that all nodes has now multiple roles (master, data, data_content, data_hot , ingest, ml etc....) as a default. Should I specifically define node.roles to [ master, data ] for all nodes instead of this default setup? Do I need to mark all nodes as master or just one node?
I'm using REST API instead of "direct" client library. I think, that load balancing is set upped there out-of-the-box, am I right?
I'm excited to benchmark these new settings! It is great to get a decent starting point, so thank you very much for your input!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.