Unbalanced shards within index

I have an index that was written to in the past, but is now only being used for searches (this has actually happened with multiple indices). At one point I noticed that this index in particular was acting sluggish; not just searches, all APIs were taking a long time to return. Digging into it I discovered that all of its shards were "out of balance". There were multiple sets of p's/r's that were much, much larger than the rest of that index's shards. And even within each of those sets the primary and replica sizes varied by a large degree (everything included here comes from a single inde):

shard prirep state      docs    store 
7     p      STARTED 4113172  636.2mb 
7     r      STARTED 4113172  630.7mb 
7     r      STARTED 4113172    631mb 
3     r      STARTED 4005369  614.4mb 
3     p      STARTED 4005369  618.8mb 
3     r      STARTED 4005369  614.4mb 
6     p      STARTED 4083918  630.5mb 
6     r      STARTED 4083918  624.9mb 
6     r      STARTED 4083918  625.3mb 
8     r      STARTED 4023651  619.6mb 
8     p      STARTED 4023651  619.6mb 
8     r      STARTED 4023651  614.6mb 
4     p      STARTED 3940152    609mb 
4     r      STARTED 3940152    604mb 
4     r      STARTED 3940152    604mb 
9     r      STARTED 4103379    4.6gb 
9     p      STARTED 4103699    2.5gb 
9     r      STARTED 4103379    5.8gb 
2     r      STARTED 3977642    5.5gb 
2     r      STARTED 3977642    5.3gb 
2     p      STARTED 3977967    6.1gb 
5     r      STARTED 3946614  606.4mb 
5     r      STARTED 3946614  605.9mb 
5     p      STARTED 3946614  610.9mb 
1     r      STARTED 6643187 1011.8mb 
1     r      STARTED 6643187 1013.1mb 
1     p      STARTED 6643187 1015.8mb 
0     r      STARTED 6615424 1006.7mb 
0     r      STARTED 6615424 1006.7mb 
0     p      STARTED 6615424 1008.5mb 

I tried searching for this problem online and one of the recommendations was to run a forcemerge, which I did, but it did not correct the imbalance, neither did forcing an index refresh. I also used the API to gather stats on the index/shards to look for documents marked for delete (did not find any, but I may not have been looking at the exact right spot) or open contexts which might be holding on to files, but that also came up empty.

Then I tried digging into the individual segments(keep in mind that at this point, every shard has only one segment):

shard prirep segment generation docs.count docs.deleted     size size.memory committed searchable version compound
0     r      _l1            757    6615424            0 1006.7mb     2261123 true      true       7.2.1   false
0     p      _zj           1279    6615424            0 1008.5mb     2260110 true      true       7.2.1   false
0     r      _n6            834    6615424            0 1006.7mb     2261883 true      true       7.2.1   false
1     r      _js            712    6643187            0 1011.8mb     2266053 true      true       7.2.1   false
1     p      _ld            769    6643187            0 1015.8mb     2264321 true      true       7.2.1   false
1     r      _mf            807    6643187            0 1013.1mb     2266022 true      true       7.2.1   false
2     r      _82d         10453    3977642            0  610.1mb     1423220 true      true       7.2.1   false
2     r      _7xt         10289    3977642            0  610.7mb     1424208 true      true       7.2.1   false
2     p      _96l         11901    3977967            0  611.6mb     1420791 true      true       7.2.1   false
3     r      _bk            416    4005369            0  614.4mb     1430751 true      true       7.2.1   false
3     p      _fk            560    4005369            0  618.8mb     1428885 true      true       7.2.1   false
3     r      _bl            417    4005369            0  614.4mb     1430441 true      true       7.2.1   false
4     r      _cv            463    3940152            0    604mb     1414100 true      true       7.2.1   false
4     p      _he            626    3940152            0    609mb     1413843 true      true       7.2.1   false
4     r      _c7            439    3940152            0    604mb     1413732 true      true       7.2.1   false
5     r      _bv            427    3946614            0  605.9mb     1411027 true      true       7.2.1   false
5     r      _e5            509    3946614            0  606.4mb     1410594 true      true       7.2.1   false
5     p      _hr            639    3946614            0  610.9mb     1409900 true      true       7.2.1   false
6     r      _cv            463    4083918            0  624.9mb     1452965 true      true       7.2.1   false
6     r      _ee            518    4083918            0  625.3mb     1453792 true      true       7.2.1   false
6     p      _jj            703    4083918            0  630.5mb     1451734 true      true       7.2.1   false
7     r      _cl            453    4113172            0  630.7mb     1464278 true      true       7.2.1   false
7     r      _hx            645    4113172            0  636.2mb     1462899 true      true       7.2.1   false
7     p      _hx            645    4113172            0  636.2mb     1462899 true      true       7.2.1   false
8     r      _ca            442    4023651            0  614.6mb     1437393 true      true       7.2.1   false
8     r      _h8            620    4023651            0  619.6mb     1436643 true      true       7.2.1   false
8     p      _h8            620    4023651            0  619.6mb     1436643 true      true       7.2.1   false
9     r      _4h9          5805    4103379            0  628.8mb     1461421 true      true       7.2.1   false
9     r      _8g8         10952    4103379            0    628mb     1463400 true      true       7.2.1   false
9     p      _h4            616    4103699            0  633.9mb     1461606 true      true       7.2.1   false

(Segment _8g8 belongs to the largest replica of shard 9)

Interestingly enough, the shards that were oversized each had a segment with a normal doc count, storage usage, and memory. The only thing strange about that was that the segment sizes for those shards were much lower than the shard size. It's as if the shard is holding onto a hidden segment. Also interesting is that the segments belonging to the oversized shards were at a much, much higher generation than the other segments. Which makes me think these segments got into a bad place where they simply lost control of large chunks data that should have been deleted.

Aside from reindexing (which I assume will work), I am interested in discovering a way to get out of this state where these bad shards are slowing down the whole index. I am also interested in learning how we got into this state in the first place.

Just to be clear, you're worried about/looking at the GB size?

Just to be clear, you're worried about/looking at the GB size?

I am yes. Shards 2 and 9 in particular jump out at me as being off.

What sort of data is it? What's sending it to Elasticsearch? Are you doing anything with custom IDs or routing?

It is structured json data being fed into Elasticsearch via bulk inserts. We are not doing any custom routing.

Also, the index in question is a one hour slice of a series of time-based indices. The other indices in the series do not have this same problem. Their sizes tend to be very consistent.

Given the document count of each of the shards is relatively close, there may be some overly large documents in there.

However I'd also suggest that 10 shards for this size of data is probably too much anyway.

I'm not sure how that could happen in our pipeline, but I'd be interested in exploring it. Is there a particularly good way to root out abnormal doc sizes?

Also, large docs could explain why one set of shards is larger than another, but would it explain why within that set there is so much variance between the replicas and primary? In my example above, one replica is twice the size of its primary. If the issue was large docs, primary and replica sizes should still be in sync.

I agree. This index is actually abnormally small, other indices in this time series tend to have shard sizes around 2.5GB. I suspect even that is smaller than would be optimal, and am in the process of exploring what our ideal shard size is (it might not be 10's of GB's because we actually have thousands of shards living on dozens of nodes). On that tangent... I know the general wisdom on shard size is that it depends on so many factors a generic answer is not possible, but if you had any more thoughts on that matter I would be very interested.

Not without the mapper size plugin being installed, which is a bit late now unfortunately.

That's what it looked like. It's something I may start doing now to help next time.

As for next steps... if I find some really big documents in those shards (I am not expecting to), it may point to some ingestion problems? Maybe data that was meant to be split between several documents was somehow crammed into one? That could explain the abnormally large shards, but (in my mind) it would not necessarily explain why the primaries and replicas are such different sizes.

If I do not find over-sized documents, does that mean the data leading to the large sized was essentially "leaked"? Possibly data waiting to be flushed during ingestion, or created as part of balancing, and then it was just never cleaned up?

Both options sound like a bug in ES, but it's possible that it is just a case of our system doing something incorrectly and ES simply not handling it properly. As for what might be causing the issue in the first place, I'd be very interested in hearing any theories that I could then try to replicate on my end. Could I make this happen with an exceptionally large insert, could I reproduce it with malformed documents, or interrupted network connections, by flipping on and off the cluster's ability to allocate, or any other number of possibilities. Just getting a short list of what might cause something like this would be very helpful.

In my mind an important clue here is the generation numbers for those shards... what if shard allocation was disabled part way through index creation? Or if nodes housing all but those two shards became inaccessible right as ingestion began?

On the other side of things, I'd also be very interested in hearing suggestions on what I might be able to do to resolve this imbalance (short of reindixing, which is resource intensive). Either those shards are holding onto a bigger portion of the data than they should be, or they are holding onto "bad" data. Is there any way for me to force the shards to balance themselves, or force some sort of cleanup? No matter what caused the initial failure, something is clearly with this index, if for no other reason than the fact that primaries and replicas should be the same size. I'm hoping there is a way for me to make ES realize something is wrong.

Unlikely.

You could try exporting the data from the index info a file and then using some tools to pull it apart to see if there are larger documents.

You could drop the replicas and then re-add them, that will force the primary + replicas to be resynced.
For the shard size itself, you'd need to reindex, and even that won't necessarily fix things.

One piece of info I can't see in here that might be useful would be the version of things you are running, OS, JVM, Elasticsearch.

Elasticsearch: 6.2
OS: Windows
File system: NTFS
JVM: 1.8.0_102

Unfortunately, I can't really say for sure if this was happening before we moved to ES 6. What I can say, and probably should have added form the start, is that I think this index jumbling is primarily happening at times when the cluster is going through some mild trauma. Nodes may be coming in/out, allocations may be disabled and re-enabled, ingestion into the cluster may be spiking, possibly even being throttled.

... If I've milked this index for all it's worth, I may experiment with it to see what happens when I remove/add back the replicas. And then verify what I've been assuming, that balance will be restored after a reindex. It would be interesting if the reindex did not resolve the problem.

I wonder if the extra space usage is due to translog retention. #30244 fixes an issue which could result in this. You can tell by looking at GET _stats?level=shards and see if the translog of the overlarge shards is unreasonably large, or if there is an unreasonably large difference between the global checkpoint and the maximum sequence number. If so, the answer is to upgrade to 6.3.0.

So, the index I was looking at before has been cleaned up already, but this happens often enough that I was able to find another instance of it... and I think you've hit the nail on the head. The index in question has not been written to for over a day now, it has a shard that is several times larger than it should be (and replicas and primaries within that single shard have a lot of variance), and for that shard, the "translog" metrics appear to be off the charts:

"translog": {
              "operations": 6697544,
              "size_in_bytes": 36371585511,
              "uncommitted_operations": 6635577,
              "uncommitted_size_in_bytes": 36035118480
            }

A normal shard in that same index looks like this:

"translog": {
              "operations": 0,
              "size_in_bytes": 43,
              "uncommitted_operations": 0,
              "uncommitted_size_in_bytes": 43
            }

So, this points to use needing to upgrade to 6.3.0?

A second question:

It looks to me that the PR you referenced specifically addresses an issue where mapping updates are applied in the middle of processing a bulk operation. So far as I know, that would not happen with us on a regular basis. Should I assume that it happens more often than I'd expect (and I should look into why)? Or am I just not thinking about it the right way, and it is just a normal part of operations?

That's the first thing I'd try in this situation.

If you have disabled dynamic mappings entirely then it's not this specific issue, but it only takes one problematic mapping update to trigger this situation. Moving to 6.3.0 will at least rule this out, before we dig into it any further.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.