I have an index that was written to in the past, but is now only being used for searches (this has actually happened with multiple indices). At one point I noticed that this index in particular was acting sluggish; not just searches, all APIs were taking a long time to return. Digging into it I discovered that all of its shards were "out of balance". There were multiple sets of p's/r's that were much, much larger than the rest of that index's shards. And even within each of those sets the primary and replica sizes varied by a large degree (everything included here comes from a single inde):
shard prirep state docs store
7 p STARTED 4113172 636.2mb
7 r STARTED 4113172 630.7mb
7 r STARTED 4113172 631mb
3 r STARTED 4005369 614.4mb
3 p STARTED 4005369 618.8mb
3 r STARTED 4005369 614.4mb
6 p STARTED 4083918 630.5mb
6 r STARTED 4083918 624.9mb
6 r STARTED 4083918 625.3mb
8 r STARTED 4023651 619.6mb
8 p STARTED 4023651 619.6mb
8 r STARTED 4023651 614.6mb
4 p STARTED 3940152 609mb
4 r STARTED 3940152 604mb
4 r STARTED 3940152 604mb
9 r STARTED 4103379 4.6gb
9 p STARTED 4103699 2.5gb
9 r STARTED 4103379 5.8gb
2 r STARTED 3977642 5.5gb
2 r STARTED 3977642 5.3gb
2 p STARTED 3977967 6.1gb
5 r STARTED 3946614 606.4mb
5 r STARTED 3946614 605.9mb
5 p STARTED 3946614 610.9mb
1 r STARTED 6643187 1011.8mb
1 r STARTED 6643187 1013.1mb
1 p STARTED 6643187 1015.8mb
0 r STARTED 6615424 1006.7mb
0 r STARTED 6615424 1006.7mb
0 p STARTED 6615424 1008.5mb
I tried searching for this problem online and one of the recommendations was to run a forcemerge, which I did, but it did not correct the imbalance, neither did forcing an index refresh. I also used the API to gather stats on the index/shards to look for documents marked for delete (did not find any, but I may not have been looking at the exact right spot) or open contexts which might be holding on to files, but that also came up empty.
Then I tried digging into the individual segments(keep in mind that at this point, every shard has only one segment):
shard prirep segment generation docs.count docs.deleted size size.memory committed searchable version compound
0 r _l1 757 6615424 0 1006.7mb 2261123 true true 7.2.1 false
0 p _zj 1279 6615424 0 1008.5mb 2260110 true true 7.2.1 false
0 r _n6 834 6615424 0 1006.7mb 2261883 true true 7.2.1 false
1 r _js 712 6643187 0 1011.8mb 2266053 true true 7.2.1 false
1 p _ld 769 6643187 0 1015.8mb 2264321 true true 7.2.1 false
1 r _mf 807 6643187 0 1013.1mb 2266022 true true 7.2.1 false
2 r _82d 10453 3977642 0 610.1mb 1423220 true true 7.2.1 false
2 r _7xt 10289 3977642 0 610.7mb 1424208 true true 7.2.1 false
2 p _96l 11901 3977967 0 611.6mb 1420791 true true 7.2.1 false
3 r _bk 416 4005369 0 614.4mb 1430751 true true 7.2.1 false
3 p _fk 560 4005369 0 618.8mb 1428885 true true 7.2.1 false
3 r _bl 417 4005369 0 614.4mb 1430441 true true 7.2.1 false
4 r _cv 463 3940152 0 604mb 1414100 true true 7.2.1 false
4 p _he 626 3940152 0 609mb 1413843 true true 7.2.1 false
4 r _c7 439 3940152 0 604mb 1413732 true true 7.2.1 false
5 r _bv 427 3946614 0 605.9mb 1411027 true true 7.2.1 false
5 r _e5 509 3946614 0 606.4mb 1410594 true true 7.2.1 false
5 p _hr 639 3946614 0 610.9mb 1409900 true true 7.2.1 false
6 r _cv 463 4083918 0 624.9mb 1452965 true true 7.2.1 false
6 r _ee 518 4083918 0 625.3mb 1453792 true true 7.2.1 false
6 p _jj 703 4083918 0 630.5mb 1451734 true true 7.2.1 false
7 r _cl 453 4113172 0 630.7mb 1464278 true true 7.2.1 false
7 r _hx 645 4113172 0 636.2mb 1462899 true true 7.2.1 false
7 p _hx 645 4113172 0 636.2mb 1462899 true true 7.2.1 false
8 r _ca 442 4023651 0 614.6mb 1437393 true true 7.2.1 false
8 r _h8 620 4023651 0 619.6mb 1436643 true true 7.2.1 false
8 p _h8 620 4023651 0 619.6mb 1436643 true true 7.2.1 false
9 r _4h9 5805 4103379 0 628.8mb 1461421 true true 7.2.1 false
9 r _8g8 10952 4103379 0 628mb 1463400 true true 7.2.1 false
9 p _h4 616 4103699 0 633.9mb 1461606 true true 7.2.1 false
(Segment _8g8 belongs to the largest replica of shard 9)
Interestingly enough, the shards that were oversized each had a segment with a normal doc count, storage usage, and memory. The only thing strange about that was that the segment sizes for those shards were much lower than the shard size. It's as if the shard is holding onto a hidden segment. Also interesting is that the segments belonging to the oversized shards were at a much, much higher generation than the other segments. Which makes me think these segments got into a bad place where they simply lost control of large chunks data that should have been deleted.
Aside from reindexing (which I assume will work), I am interested in discovering a way to get out of this state where these bad shards are slowing down the whole index. I am also interested in learning how we got into this state in the first place.