Index.priority setting not working for replicas

Using 6.7.0 and when my cluster recovers, replicas with NO index.priority setting are recovered before replicas with an index.priority set. Is this a known issue? Should I just set a priority on all indices? This is causing an issue when trying to change documents in that unrecovered index.


Index priorities aren't particularly strict for replicas, because Elasticsearch can only start to allocate replicas once the primary is allocated, so if a low-priority primary is started quickly then its replicas might start to be recovered before the higher-priority-but-slower ones.

Can you explain in more detail how this is an issue for you? You should be able to write to an index as soon as its primaries have started.

The index I'm trying to write to is tiny (1 shard, 3 replicas, < 1MB), contains sequence numbers, and a simple post to replace the document is timing out after 60 seconds. I'll have to research more why I guess, however as soon as that index does recover things start working. Elasticsearch is busy recovering old time based indices that are not even being used or important. (Multiple shards, mulitple G each). It looks like it is recovering indices in reverse alphabetical order.

The 60-second-wait-then-fail sounds like the behaviour you get when the primary isn't allocated. Elasticsearch limits the number of recovering shards (primaries and replicas), and perhaps this throttling mechanism is causing this. There are other reasons why the index priority isn't a strict order even for primaries.

Can you reproduce this with DEBUG?

Also, what is the cluster recovering from?

This index has the highest priority and it was in yellow state, in fact the whole cluster was in yellow state. The primary had been recovered but the 3 replicas had not. In this case it was from about a 75% cluster bounce.

On which node does DEBUG have to be set? all of them? Does that require a bounce or can it be set with _cluster/settings

Really it's just the master that cares about this, but it's probably simplest to set it through the API:

PUT _cluster/settings

Ok did another bounce. At yellow cluster state. I just see these types of messages over and over, and there are no entries for my sequence index, just all the time based session2* indices.

[2019-04-23T21:45:27,292][DEBUG][o.e.g.G.InternalReplicaShardAllocator] [es-02a] [[sessions2-190328h13/xBLyKANjSa-rpvQfVqkcnw]][1]: allocating [[sessions2-190328h13][1], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2019-04-23T20:45:31.329Z], delayed=false, allocation_status[no_attempt]]] to [{es-51}{NSWdv8rjS_6Jh9WHl2dYow}{_5uJzM8DTfaF1a0KG4ayng}{es-51.domain}{}{rack_id=moloches51}] in order to reuse its unallocated persistent store

Here is what the current health looks like

epoch      timestamp cluster       status shards   pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1556056252 21:50:52  Moloch        yellow         97        89  41166 39854    0  429    38013           537                52m                 51.7%

Not sure if related, but I have several other indices that are in this state where they are the last to recover, and I just noticed they all use auto_expand_replicas. Example:

   "settings": {
      "index.priority": 100,
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "auto_expand_replicas": "0-3"

Ok, this says yellow which tells us that all the primaries are assigned, and therefore it should be possible to write into every index. Just to be clear, you're saying that you're still unable to write into one of your indices even in this state?

The log message you quoted is about allocating a replica, which shouldn't be important; I was hoping to see messages about the allocating of the primary of the problematic index to indicate whether they're happening after the allocations of all these other replicas.

When your write fails, what is the response? Are there any corresponding exceptions in the log at the same time?

I have something of a theory about what's happening here. If you try and write to a yellow index then Elasticsearch will write to all the available copies and will also mark any unavailable in-sync copies as out-of-sync. Marking a shard as out-of-sync needs the master to update the cluster state.

However, you have ~80,000 shards in this cluster, which is quite a lot, and this means that starting each batch of shards takes quite some time on the master. In part this is because when each batch starts the master must calculate the next set of recoveries to start, which is a little time-consuming with so many shards in play. Finally, starting shards runs at a higher priority than marking them as out-of-sync, so I suspect that the master is too busy with the recoveries to get around to the out-of-sync thing until the write has timed out.

The recommendation for a full cluster restart is to wait for the cluster to report green health before starting to use it. If the theory above is right then a possible workaround would be to restart your nodes with cluster.routing.allocation.enable: primaries as per the full cluster restart docs, write once to the problematic index to mark the non-primaries as out-of-sync, and then re-enable allocation. I think a more general solution would be to split this enormous cluster into a larger number of smaller and more manageable clusters.

Yes your theory is pretty much what I think is happening. Hence why I was hoping elasticsearch could be made to honor the priority for replicas, or have a setting that does that? Or a setting that instead of whatever method it uses now to pick which to complete first, do smallest to largest?

It is puzzling that it's taking so long to get this index to green. "Smallest to largest" wouldn't behave significantly differently from the existing priority order so I don't think that'd help. Is your cluster back to green yet? Can you share the debug logs you collected?

Yep we are green now. Sure can send logs, directions to how to send privately?

I was suggesting smallest to largest because these are my smallest shards/indices. Or are you saying even if this index went green, I still wouldn't have been able to write to it?
1M/1shard/3rep vs 2T/40 shard/1rep

We are going to try reducing the number of shards some which should help the master out.

If they fit in an email then you can send them to; if too large for that I can sort out an alternative. It might be useful for me to quote some things back to you here; if there's things you'd like me to redact when doing so then let me know.

No, if my theory is right then there'd be no problem writing to this index once green. The issue with smallest-to-largest is that we already prefer this small index based on its priority, so adding some other priority rule isn't obviously going to help.

Thanks for the logs. I think I see what's going on. There are two other things that trump the index priority when it comes to allocation order:

  1. primary allocation is more important than replica allocation
  2. allocating replicas to nodes that share segments with the primary is more important than allocating other replicas

Since you have a lot of shards so it takes quite some time (~50 minutes) to even start the primaries. I then can't see any messages about the replicas for sequence_v2 being allocated by the InternalReplicaShardAllocator, but this allocator only works on replicas that share segments with the primary, so this suggests that the replicas for sequence_v2 aren't in this category. That'd be the case if, for instance, this index sees a lot of updates but doesn't see very many new documents being added.

The reason for this second priority is somewhat historical, and we are working on removing it.

Ok thanks! Any github issue I can track?

No, we're tracking this internally but there's no issue open for it yet.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.