We're running an ES cluster which keeps daily indexes per customer for a 1 year period. Seeing that we have around 100 customers and replica shards enabled this puts us in the ballpark of 50k shards.
The behavior we're seeing is that while query and indexing performance is ok, operations which require the master nodes like create/delete index are getting gradually worse. Currently deleting an index can take up to a minute!
I assume the issue has to do with the cluster state objects kept on the master's, however we don't see any indications of stress on the masters, their jvm heap usage peaks at 75% cyclically and cpu stays around 50%.
How is one supposed to scale the masters in this situation? If we double the allocated Xmx could that help? Couldn't seem to find the right config params to change asides for this.
Our shards are not absolutely huge, the reason why we don't unify them into say monthly indexes instead of daily is that ES doesn't perform well on document deletes (which trigger merges). Our daily index setup allows us to always create fresh new indexes and simply delete the old ones without updating/deleting records. Until we ran into this scale issue with the masters this strategy reduced load considerably on our cluster.
Any advice or ideas is welcome! Here's our specs:
20 data nodes, 3 client nodes , 3 master nodes. 50k shards, 42TB of data.
Master nodes currently have 16Gb memory
BTW - Just wondering, why don't the master nodes share the cluster state instead of assigning a leader? If I have several master nodes anyway to avoid a single point of failure then surely its much more scalable if the cluster state can be distributed among them?
Also another point is why does the cluster state have to keep mappings/settings for each index in memory? In our used case all the indexes need identical mappings and we'd prefer one mapping for all of them not 50k. Is there any way to do this?
you mean to have several clusters and share the shards between them?
Doesn't sound fun. It would mean extra client logic to decide which cluster to go to and also duplicate master/ client nodes to pay for
Unfortunately "scaling the masters" isn't how Elasticsearch expects to grow. There's effectively one master node, and the other master-eligible nodes are there to take its place in case of failure. Sharing the cluster state between distinct masters is effectively creating multiple clusters. That might be worth investigating further, particularly when using cross-cluster search to search across the clusters.
Some ideas (which you may already be doing/have tried)
don't create more than one shard per index.
don't create new indices so frequently. Perhaps weekly indices rather than daily?
don't create distinct indices for each customer.
Why do you need to delete documents with monthly indexes? Can you not just leave them there until the whole month has expired? Also what performance problem are you seeing with document deletion and subsequent merges? This should happen in the background.
We need to update older months sometimes with recalculated data, so this causes merges. Our experience has been that we had 'merge storms' which increased cluster load to the extent that queries performance was affected.
Though the problem could have been that we started off at the other extreme with only 12 indexes and several hundred huge shards. So now we went to the other extreme of 'immutable' daily indexes. Now we're going to move to bi-weekly per customer. Hope it's finally the right balance
Also thanks for explaining the master nodes thing. Another feature which sounds helpful to me would be shareable mappings/settings for multiple indexes. When we change a mapping we want this change to affect all indexes in the cluster so its just a waste of space/effort for us to have individual mappings stored in the master-state for each index
Thanks for the insights! I'll try address our thoughts on them,
max_thread_count setting for merge
We did play with this, I think we tried setting it to 1 to slow down the merges. Didn't help enough
Its not so applicable for us since also older months data is updated
index-per-customer short version: We need this architecture so that we can decide per customer How many months of data to keep. Also we did a POC of a large multi customer index and queries where slightly slower.
long version: Originally our index was on ES 1.7 with routing by customer. The routing caused our shards to be highly unbalanced some where small and some where 150Gb. Its likely that this caused the merging issues.
When we migrated to ES 5 we still had merge issues but we where advised that routing isn't needed anymore. So we checked different options. We tried the original multi-customer monthly index just without routing and also daily indexes per customer (with per customer aliases joining them all)
The daily index performed much better on queries.
Now that we hit the to-many-shards issue I think the right balance is monthly indexes per customer.