Closing indices by itself

Hi

Our cluster has begun going into RED state every few minutes.
The log doesn't tell anything about why it's going to RED, but judging from the below, it seems that it's closing and opening indices by itself, for some unknown reason.

It should be noted that we're not at all under memory or CPU pressure, but this cluster does have 2029 indices with a total of 8100 shards (2p+1r), split across 5 nodes.

[2018-07-12T21:18:03,978][INFO ][o.e.c.m.MetaDataIndexStateService] [Prod-04] closing indices [[[index1/rXe9XZwkQ2iXIUrf5FJLzQ]]]
[2018-07-12T21:18:04,609][INFO ][o.e.c.m.MetaDataIndexStateService] [Prod-04] closing indices [[[index2/81rY1m69RMWW6FIGsPDrug]]]
[2018-07-12T21:18:06,436][INFO ][o.e.c.m.MetaDataIndexStateService] [Prod-04] opening indices [[[index1/rXe9XZwkQ2iXIUrf5FJLzQ]]]
[2018-07-12T21:18:07,109][INFO ][o.e.c.m.MetaDataIndexStateService] [Prod-04] opening indices [[[index2/81rY1m69RMWW6FIGsPDrug]]]
[2018-07-12T21:18:09,967][INFO ][o.e.c.r.a.AllocationService] [Prod-04] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[index2][0], [index2][1]] ...]).
[2018-07-12T21:18:10,046][WARN ][o.e.c.a.s.ShardStateAction] [Prod-04] [index2][0] received shard failed for shard id [[index2][0]], allocation id [9_YF-DRLSWq4Lx-FpX3shg], primary term [3], message [mark copy as stale]
[2018-07-12T21:18:10,118][INFO ][o.e.c.m.MetaDataIndexStateService] [Prod-04] closing indices [[[index3/RbneN50ZTSC762qnkY_j4A]]]
[2018-07-12T21:18:11,072][WARN ][o.e.c.a.s.ShardStateAction] [Prod-04] [index2][1] received shard failed for shard id [[index2][1]], allocation id [-9PO9zR2RaK2P1QnIsHtmw], primary term [4], message [mark copy as stale]
[2018-07-12T21:18:11,191][WARN ][o.e.c.a.s.ShardStateAction] [Prod-04] [index2][0] received shard failed for shard id [[index2][0]], allocation id [9_YF-DRLSWq4Lx-FpX3shg], primary term [3], message [mark copy as stale]
[2018-07-12T21:18:11,259][WARN ][o.e.c.a.s.ShardStateAction] [Prod-04] [index2][1] received shard failed for shard id [[index2][1]], allocation id [-9PO9zR2RaK2P1QnIsHtmw], primary term [4], message [mark copy as stale]
[2018-07-12T21:18:11,568][WARN ][o.e.c.a.s.ShardStateAction] [Prod-04] [index2][1] received shard failed for shard id [[index2][1]], allocation id [-9PO9zR2RaK2P1QnIsHtmw], primary term [4], message [mark copy as stale]
[2018-07-12T21:18:11,580][WARN ][o.e.c.a.s.ShardStateAction] [Prod-04] [index2][1] received shard failed for shard id [[index2][1]], allocation id [-9PO9zR2RaK2P1QnIsHtmw], primary term [4], message [mark copy as stale]
[2018-07-12T21:18:11,717][WARN ][o.e.c.a.s.ShardStateAction] [Prod-04] [index2][1] received shard failed for shard id [[index2][1]], allocation id [-9PO9zR2RaK2P1QnIsHtmw], primary term [4], message [mark copy as stale]
[2018-07-12T21:18:12,028][WARN ][o.e.c.a.s.ShardStateAction] [Prod-04] [index2][1] received shard failed for shard id [[index2][1]], allocation id [-9PO9zR2RaK2P1QnIsHtmw], primary term [4], message [mark copy as stale]
[2018-07-12T21:18:12,094][WARN ][o.e.c.a.s.ShardStateAction] [Prod-04] [index2][1] received shard failed for shard id [[index2][1]], allocation id [-9PO9zR2RaK2P1QnIsHtmw], primary term [4], message [mark copy as stale]
[2018-07-12T21:18:12,107][WARN ][o.e.c.a.s.ShardStateAction] [Prod-04] [index2][1] received shard failed for shard id [[index2][1]], allocation id [-9PO9zR2RaK2P1QnIsHtmw], primary term [4], message [mark copy as stale]
[2018-07-12T21:18:12,232][WARN ][o.e.c.a.s.ShardStateAction] [Prod-04] [index2][1] received shard failed for shard id [[index2][1]], allocation id [-9PO9zR2RaK2P1QnIsHtmw], primary term [4], message [mark copy as stale]
[2018-07-12T21:18:12,271][WARN ][o.e.c.a.s.ShardStateAction] [Prod-04] [index2][1] received shard failed for shard id [[index2][1]], allocation id [-9PO9zR2RaK2P1QnIsHtmw], primary term [4], message [mark copy as stale]
[2018-07-12T21:18:12,337][WARN ][o.e.c.a.s.ShardStateAction] [Prod-04] [index2][1] received shard failed for shard id [[index2][1]], allocation id [-9PO9zR2RaK2P1QnIsHtmw], primary term [4], message [mark copy as stale]
[2018-07-12T21:18:12,533][WARN ][o.e.c.a.s.ShardStateAction] [Prod-04] [index2][1] received shard failed for shard id [[index2][1]], allocation id [-9PO9zR2RaK2P1QnIsHtmw], primary term [4], message [mark copy as stale]
[2018-07-12T21:18:12,581][WARN ][o.e.c.a.s.ShardStateAction] [Prod-04] [index2][1] received shard failed for shard id [[index2][1]], allocation id [-9PO9zR2RaK2P1QnIsHtmw], primary term [4], message [mark copy as stale]
[2018-07-12T21:18:12,651][INFO ][o.e.c.r.a.AllocationService] [Prod-04] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[index2][1]] ...]).
[2018-07-12T21:18:12,929][INFO ][o.e.c.m.MetaDataIndexStateService] [Prod-04] opening indices [[[index3/RbneN50ZTSC762qnkY_j4A]]]
[2018-07-12T21:18:14,743][INFO ][o.e.c.r.a.AllocationService] [Prod-04] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[index3][1], [index3][0]] ...]).
[2018-07-12T21:18:15,977][INFO ][o.e.c.r.a.AllocationService] [Prod-04] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[index3][0], [index3][1]] ...]).
[2018-07-12T21:19:00,676][INFO ][o.e.c.m.MetaDataIndexStateService] [Prod-04] closing indices [[[index4/I_GucK5USTq14yLAjeuGTQ]]]
[2018-07-12T21:19:02,120][INFO ][o.e.c.m.MetaDataIndexStateService] [Prod-04] opening indices [[[index4/I_GucK5USTq14yLAjeuGTQ]]]
[2018-07-12T21:19:04,762][INFO ][o.e.c.r.a.AllocationService] [Prod-04] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[index4][1]] ...]).
[2018-07-12T21:19:06,953][INFO ][o.e.c.r.a.AllocationService] [Prod-04] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[index4][1]] ...]).
[2018-07-12T21:21:00,871][INFO ][o.e.c.m.MetaDataIndexStateService] [Prod-04] closing indices [[[index5/yM15ExujQq-69pEqGdDTtQ]]]
[2018-07-12T21:21:02,517][INFO ][o.e.c.m.MetaDataIndexStateService] [Prod-04] opening indices [[[index5/yM15ExujQq-69pEqGdDTtQ]]]
[2018-07-12T21:21:04,503][INFO ][o.e.c.r.a.AllocationService] [Prod-04] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[index5][1], [index5][0]] ...]).
[2018-07-12T21:21:06,431][INFO ][o.e.c.r.a.AllocationService] [Prod-04] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[index5][0]] ...]).

Any suggestions as to what is causing this?

Indices don't just open and close themselves, so something must be calling the API to do this.
You may want to enable debug logging to see if you can catch things, or enabled X-Pack Security to stop and track things.

That's overloaded, you should look at reducing this ASAP.

Hi, warkolm

You made us go back and look at the code another time and it does actually seem there is a common code-path leading to opening and closing of indices, so thank you for pointing out that ES cannot do that by itself.

So to follow up on your suggestion of bringing down the shard cound:
As we're running a multi-tenancy setup with an index for each tenant, we need the relevancy for each tenant to be as high as possible, which, IIRC, is closely related to the term stats for each index. If I begin merging more tenants into the same indices, the result would be that they would also share term statistics, which would make algorithms s like TF/IDF give less relevant results.

Any way to solve this?

Can you reduce your primary count?

Good suggestion.

I could, and I probably will. It's right now at 2p+1r. I could bring that down to 1p+1r, even though query performance may suffer a little from reduced parallel querying. I'd have to test how much actual impact that has.

You don't know of a way to solve the term stats per. tenant issue, do you?
I guess, even though I'm bringing down the number of primaries, then I'd be running into the issue again down the road as the system scales.

Is the solution then to split into multiple separate clusters? I guess that would bring down the size of the cluster state, but I'm not sure if the size of that is actually an issue at all.

You could use routing, that way all the user's docs would be in the same shard for scoring. But I think that may create more issues as you'd potentially need an index with a lot of shards.
Or add more nodes, ideally aim for <600 shards per node.

Ultimately if you aren't suffering resourcing issues then it may be a moot point for your own benefit/comfort, but that per node count is higher than we recommend.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.