Many indices and slow cluster state updates

Hi All,

we are about to learn (the hard way) that the index-per-user model is the wrong way to go.
Initially we planned for one index per user but with the announcement that types are going to be removed in ES6 we now ended up with 7 indices per user.
Overall we now have a 4-node cluster with 30k indices with overall 66k shards with additional indices being created all the time.
The only reason that this cluster is still alive is that we got some good SSD Raid0 setups in the node.

We now know that this is absolutely wrong and we are about to fix this by re-indexing the data to weekly indices.
But the cluster is about to reach the critical limit for cluster state refreshs to be acknowledged in time by all nodes.
Even though there are only diffs of the cluster state being communicated since, state updates always seem to include the full routing table (which is massive with 66k shards).

We are looking for a short term solution to get rid of some of this load without actually losing the data while we re-index the data into the new schema.
We only need to survive a few more days to prepare for the data model change.

Can anyone help us on how we could reduce the cluster load with regards to state refreshs and routing table comms?
Would it for example help if we we close a portion of the indices? Does that exclude them from the routing table?
Or any parameter that we could tweak?

Thanks and regards

I wonder if filtered aliases would not be a great fit in your case.

So basically you will have only 7 index in total, with x shards per index, let say 5 so you would end up with 70 shards (including replica) instead of 66k.

Then use routing=user_id at index and search time.
An alias named indexa_userid then would have a builtin filter like "term": { "user" : "userid" } and the routing key as well.


That sounds interesting to solve my datamodel problem but if I add routing to all queries I could also add a term filter in all my queries to achive the same thing (with the only difference that all my users data is spread across all nodes in the cluster which may or may not be an advantage).

Anyway I will checkout named aliases to see if they can help me.

But what I really need is a temporary solution to reduce the load on the cluster caused by the large routing table in order to survive a few more days until the data model change is implemented and the data re-indexed.

Routing and term filter are not the same.

Routing ensures that you will hit only the right shards which contains your data.
Filter will make sure that you only see what you need to see.

Basically a shard will contain data for users x, y, z. Routing will just execute the query on this shard. Filter will reduce the result set to only documents for user x.

In short: you must use the term filter. You can use the routing key for better performance.

But what I really need is a temporary solution to reduce the load on the cluster

It's hard TBH. I mean the only short term solution I can imagine from the top of my head is to add more nodes. May be you should start 3 "small" master only nodes.
Then stop the current master node. And restart it.

I'd disable allocation while doing that so you won't have tons of data moving.
And re-enable after the old master is back online.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.