We are running in production an elasticsearch cluster. The setup is:
10 data nodes - 8 cpus and 64G ram, half is heap
3 master nodes - 4 cpus and 16g ram, half is heap
In our cluster we have large number of indices, around 18,000 with 1 replicas (36,000 shards)
We have a problem of getting slow search responses once in a couple of search requests (investigate the profile api led to conclustion it's not someting about the search itself, as the profiling metrics looks good, but something else in the infra).
Our suspicion is that the problem is the large number of indices we have on the cluster, or maybe something similar which related to the master nodes. After increasing the cpu and memory of the master nodes we got much better search performance in our app.
Looking in the elastic docs we found this:
Elasticsearch is a peer to peer based system, in which nodes communicate with one another directly. The high-throughput APIs (index, delete, search) do not normally interact with the master node. The responsibility of the master node is to maintain the global cluster state and reassign shards when nodes join or leave the cluster. Each time the cluster state is changed, the new state is published to all nodes in the cluster as described above.
It seems that the master nodes are not involved in the high-throughput apis, so how our problems and the improvement in search due to the change in the master nodes can be explained?
We will be happy to understand all of this as it is not clear from the docs.
That's 3735 shards per node with an average shard size of less than a gig.
You are seriously wasting way too much heap managing those shards, which is likely the majority of the issue you have.
The only thing I can think of is that your master nodes can better handle the allocation tables of the sheer number of shards you have in your cluster with more resources. Frankly though, focussing on that is missing the forest for the trees, and if you don't fix that you are just asking for continued problems.