I have been using elasticsearch for development on my local machine and managed to setup a 3 nodes cluster by starting each node with the following command:
And having discovery.zen.minimum_master_nodes: 2 in elasticsearch.yml
For production I also have only 1 machine and I'm looking to setup the same 3 nodes cluster.
The issue is that pretty much all examples describe each node having its own elasticsearch.yml where you can set the configurations for each separate node.
Am I missing something or is it just the difference between using different machines for each node instead of 1 ? And if so, if I'm using the setup described above, is it somehow possible to setup individual nodes differently under the same elasticsearch.yml
Each node does have it's own data directory in the case I described. They just happen to have the same config file.
What do you expect to gain from having multiple nodes on a single host?
Redundancy, in case one fails. Isn't that the whole point ? Also parallelism.
If you have a machine with 4-8 or 32 cores. Are you using a single node ? So, why install Elasticsearch x number of times on the same server when it allows you to create multiple nodes with only one installation ? I simply don't understand the point of it in the first place. Can you please explain.
Unless you have a large bare-metal server with 128 GB RAM or more, having multiple nodes on the same host does not makes much sense.
You will have multiple small nodes competing for the same resources, elasticsearch is more bound to the disk and memory than to the CPU cores and in the end if your host fails, your entire cluster fails. In this case is better to have a larger node instance.
But to run multiple nodes in the same hosts you need to have a different elasticsearch.yml for every node with separated data and log folders, there isn't a way to use the same elasticsearch.yml to run multiple nodes at the same time.
Even if you have large servers with 128 GB or more, I would recommend to use docker to run the instances instead of configure multiple nodes in the same host.
Yet on a single node cluster replica shards can't be allocated. Isn't this important as well ?
If you are running a single instance of Elasticsearch, you have a cluster of one node. All primary shards reside on the single node. No replica shards can be allocated, therefore the cluster state remains yellow. The cluster is fully functional but is at risk of data loss in the event of a failure.
If I start Elsticsearch instances with the following command, I'll end up having 1 config folder, so basically 1 elasticsearch.yml file, and multiple data folders, each holding separate nodes. There's also a node.max_local_storage_nodes setting.
Share storage. It just stuck to me the idea that you need multiple nodes, no matter what, and that it automatically increases performance and reliability regardless of the actual number of servers. The idea was consolidated by the fact that you can create such a thing as multiple nodes and a single elasticsearch.yml
I will probably follow your advice and create only 1 node, but the whole thing is a bit weird, because all documentation seems to point towards having multiple node clusters.
Besides what Christian already said, running multiple nodes in the same host with shared storage and using replicas can make your performance worst because you will be using the same IO resources to write the same thing in the same place many times.
For example, if you have an index with 3 shards and 1 replica, in total you will have 6 shards, 3 primaries and 3 replicas, but in the end your storage is shared, as are your resources, so you will be writing the same thing in the same place twice.
Also, when you use elasticsearch -E config=value, you are in reality not using elasticsearch.yml for this config, I don't know if you could pass every node config using command line, never tested, but I would recommend to stick with one elasticsearch.yml per node if you want to follow in this path.
And the config node.max_local_storage_nodes is deprecated, it will be removed in version 8.X
Got it. In this case, when needing to increase performance, does it make more sense to split the same resources over multiple machines ? Do you think there is a significant difference in performance in having (for example):
1 host with 8 cores / 32 GB RAM or
2 hosts with 4 cores / 16 GB RAM each ?
It really depends on your use case, but there are some things that you could follow.
It is recommended to leave 50% of the memory for the system processes and use other 50% to the JVM Heap size of the elasticsearch process.
With 32 GB of RAM, you can have a Heap size of 16 GB and there is another recommendation that is to have a maximum of 20 shards for each GB of Heap per node, so with 16 GB of Heap you would be able to have something like 320 shards in this node.
So, since elasticsearch will per default balance the number of shards between the nodes, in theory a single node with 32 GB of RAM and 16 GB of Heap is a better choice.
You can read more about it in this blog post that has some recommendations regarding heap and sharding.
It is always recommended to have at least 3 master eligible nodes. I would stick to 3 nodes and make them reasonably large rather than setting up lots of very small nodes. The amount of CPU required will depend on your use case so you need to test. Elasticsearch is IMHO more often limited to storage speed than CPU so you need to find a balance of resources.
Just so I can get an idea about future optimizations. At this point I have 2 indices:
index A: 1GB, 750.000 documents
index B: 30mb, 350.000 documents
The documents are products and the most expensive operation is running all (7) aggregations on all products for which I get a response in 0.5 seconds. This while it's only me using the server (my local computer). As soon as I start filtering by different aggregations the response time decreases. And for full text search I get the response in under 0.1 seconds.
Are these good starting results ? And should I go with 1 shard per each index and only increase the number of shards in relation to the volume of data ( each extra 20GB as per the documentation ) ? This is my first elasticsearch project that I'm currently looking to deploy to production. Because I plan to increase the indices 10-20 times I'm a bit worried weather I'm on the right path or not and thought I'd ask this as well.
This isn't quite right. It is recommended to leave at least 50% of the memory for the system processes and filesystem cache (and Elasticsearch's non-heap memory needs) and use at most 50% to the JVM Heap size of the Elasticsearch process. There are definitely cases where a smaller heap performs better thanks to the larger filesystem cache.
Good points from everyone else thus far. If you're looking to just learn how the general configuration of a cluster is configured and cant spin up multiple hosts (physical or VM) then you can do it in Docker.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.