I want to know what configuration setup would be ideal for my case. I have 4 servers (nodes) each with 128 GB RAM. I'll have all 4 nodes under one cluster.
Total number number of indexes would be 10, each getting data of 1500000 documents per day.
Since I'll have 4 servers (nodes) so for all these nodes I'll set master:true, and data:true, so that if one node goes down, other becomes master. Every index will have 5 shards.
I want to know which config parameters should I alter in order to gain maximum potential from elastic.
Also tell me how much memory is enough for my usage, since I'll have very frequent select queries in production (may be 1000 requests per second).
This is hard to answer without saying 'it depends'. The configuration parameters which are best for your use-case depend on a huge number of variables such as, the size of your documents, how many fields a document contains, what type of fields you have, how the fields are analyzed, what queries you are running , to name just a few. The best way to tune your Elasticsearch instance is to run representative queries on your data (or a sample of your data), measure performance (memory usage, query times, etc.), identify bottlenecks and search the documentation and the archives on this forum to identify how to fix those bottlenecks.
Of course, if you can't find the way to resolve a particular bottleneck you can also open a topic in this forum. If you do, please try to give as much information as possible about the problem you are seeing and what you have tried to resolve it. This will help a lot in finding a solution for you quickly.
There are a few pointers to give you to get you started though. The book 'Elasticsearch - The Definitive Guide' is free online and has a chapter dedicated to things to consider for production deployments.You can find it here: https://www.elastic.co/guide/en/elasticsearch/guide/current/deploy.html
I would recommend reading as much of that book as you can as it will give you an understanding of what Elasticsearch is doing under the covers and how you might be able to solve the problems you encounter.
Also we generally recommend you assign 50% of your servers memory for the JVM heap and leave the rest to the file system cache (Elasticsearch makes heavy use of the File System Cache). You did say, however, that your servers have 128GB memory, and so 50% of your memory would be 64GB which is above the maximum recommended heap size of 31GB that we recommend (this is because above 31GB heap the JVM cannot use compressed pointers and performance actually decreases). Here you can either assign each node 31GB of JVM heap memory and leave the remained to the file system cache or run two nodes on each server, each with 31GB heap. More information on this can be found here: https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html (especially in the 'Dont Cross 32 GB!' section)
@colings86 thanks for your detailed answer. Here are my few more questions continuing the above.
Here is the scenario explained:
I have four nodes (4 servers), that are divided in two regions, for e.g 2 servers in Germany , and remaining 2 in france.
Germany servers naming (g1 and g2) -> g1 being primary server and g2 is it's failover.
France servers naming (f1 and f2) -> f1 being primary server and f2 is it's failover.
Now we have ES setup on g1, g2, f1 and f2 under one cluster.
My web application that resides on g1 and f1, being it's failover on g2 and f2, wants to communicate to elasticsearch that has availability on all servers, but what I want is to query the nearest ES node which means web app on g1 should ask for data to g1 node (ES), and same way app on f1 should ask to node f1(ES), but there may be option if web app on g1 asks for data to node g1 (ES) but it is unavailable at that time, in that case I want g2(ES node) to respond, since g2 node is nearest to it, so data will quickly get received.
My questions are:
How should I smartly query to specific nodes knowing that are near to my web app, and they respond quickly rather than querying node in other region ?
If nearest node fails, how can I smartly query the next nearest node (which is up) ?
I want my data to be present on all the nodes, so is it a good practice to keep master:true and data:true for all nodes ?
Don't run cross DC clusters, we don't support it due to latency sensitivity and potential for split brain.
it also makes questions like the ones you are asking harder to answer.
Have two clusters, replicate the data between them, and then query each cluster locally.
@warkolm In case there is physical disaster in one region say Germany, and I heavily rely on my ES data, how can I get my data back, if I haven't setup another node in some other region say France. Please guide.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.