That's a long story that I may not have time to fully explain today.
Short version:
Some specific changes to the disk space filters or something similar would go a long way. In general I need to identify stale indices (not recently updated) with inefficient shard sizes (say less than 25GB/shard).
I have to keep all of my data until the cluster is at full capacity (or 85%). I have dense storage, so I'll exceed reasonable shard counts before I run out of raw space unless I condense inefficient indices. About 6.4TB SSD per 128GB RAM.
Currently I wrap curator shrink and reindex jobs in scripts to do extra filtering to identify stale indices and calculate more efficient shard counts given their amount of data.
I've tried extensively to manipulate the "space" filter for this purpose but it really doesn't suit this need.
Long version:
I have an application that for many reasons generates somewhat unpredictable index names. They can all start with a predictable prefix, but will have an unpredictable suffix based on specific requirements.
They'll all look something like: dataset_001_0_12345678
I cannot change the naming convention at this time, it's tied very deeply to how an application works and parent child relationships. That's all going away in a future version as the application moves to a flat structure and can use normal rollover features to avoid this whole situation.
For now, these indices typically start out around 10 primary shards with replicas. Some of them grow to around 250GB (primary size) on average, or 25GB/shard. This is fine. Other ones may only grow to a couple GB or less but still 10 shards each. These indices are created and filled very quickly either way, and the data within them must be retained for months.
Due to unpredictable data from the application, I end up with thousands of shards overall. Maybe 7 to 9 thousand primary shards currently on one example system across 25 nodes.
This is down from a much higher number, with extensive help from curator jobs to shrink indices that were at least around 25GB total from 10 shards down to 1.
But I'm also focused on consolidating lots of smaller indices together, to retain the data but reduce the overall count. I have a requirement to be able to query all of the data and I need the higher shard count for fast writes as the data arrives, but it can be condensed once each index stops updating, even if that impacts the general read performance of particular chunks of data.
So I'm basically trying to either:
- Shrink individual indices to average at least 25GB per shard.
- Reindex multiple indices to consolidate more shards toward the same goal.
I also have to detect failures outside of curator, clean up the failed indices and try again or move on to another candidate.
For instance, about 1 out of 10 times, curator shrink creates a new index but fails to allocate its shards and exits, leaving that index "red". So we delete and retry. I can open a bug ticket about this later.
For reindex, we verify before and after document counts for confidence before deleting the original indices in favor of the newly condensed one. If the reindex action could measure success and auto-delete inputs, that would also be helpful.