I've a legacy Elasticsearch 2.3 index. It is no longer being updated with new entries, instead the only modifications are deletes. All our new data targets an Elasticsearch 7 cluster. Though we do still read from the ES2 index.
The goal for the legacy 2.3 index is to scale it down. One year ago it was made up of ~9.5 billion active documents with ~1 billion deleted documents. Now those numbers have changed to ~8 billion active documents with ~ 2.5 billion deleted documents.
My concern is that the total number of documents has not reduced much at all. i.e. It has stayed at ~10.5 billion. This is somewhat of a problem as it is delaying our efforts to scale back the 2.3 cluster.
Over the last year only ~47 million documents were purged. ~37 million of which were purged over a ~9 day period.
Reading up on this topic I became aware of the forcemerge API. Is this my best option? Running it against a local (unfortunately ES7) docker cluster I found that it does remove deleted documents when only_expunge_deletes is set to true.
But there's obviously a big difference between a local docker cluster (with the wrong ES version) and a production index which is still being read.
In short what's my safest option or strategy to tackle this problem and reduce the ES2 disk size?
I'll also include a subset of the _cat/segments API output in case it helps. I'm including 200 lines but the total output is 20,612 lines. The pattern through out is fairly consistent. i.e. Most 4.x gb with more than 10% of the segment made up of deleted documents.
I should mention there are indexes other than this-legacy-index on the ES2 cluster. I'm focusing on this-legacy-index as it is by far the largest and oldest. The other indexes have a similar segmentation, though there might be slightly less 4.x gb segments.
I'd try running a force merge on the index, but you might want to do that in a quiet time, as they were not as well managed by the cluster as they are with newer versions.
A follow up question. We ran a force merge over the weekend. This has caused most our deleted data to be removed. Namely ~2.5 billion documents. That's great.
I can see it also dropped our segment count from ~20k down to ~2k. But the downside is that we now have 110 segments sized with more than 100 GB of data. For performance reasons we've tried to keep our segments below 5 GB.
Any thoughts or recommendations on how we can re-balance the segments?
I focused on segments as the advise was to keep them below 5 GB. But I'm not sure how strictly that needs to be followed. Monitoring the legacy index after the forceMerge I see the search duration has increased from peaks of ~700ms up to peaks of ~1.2s. There's also an increase in heap usage from ~2 GB to ~3.3 GB. But that seems acceptable.
For shards we focus on keeping them below ~50 GB.
The goal for the legacy index is to slowly reduce it. As such we've also kicked off a node reduction. i.e. Going from ~180 to ~150 nodes. This could also be contributing to the above.
Honestly, upgrading will get you much further.
As a high level example, not taking into account a tonne of things, 7.X should handle those 2700 shards on 4 nodes. 2.X does not manage shards efficiently.
Before my time, a contractor was hired to give advice on the operation of Elasticsearch. One of the take-away points was to keep the segment size below 5 GB. This point stuck in the "team memory".
While that is not exactly reliable, I noticed Elasticsearch itself, before the forceMerge, kept the segment size below 5 GB. e.g. 4.7 GB, 4.8 GB, etc. I also came across blogs like this that mention
A maximum sized segment (default: 5 GB) will only be eligible for merging once it accumulates 50% deletions
I'd be in favour of migrating the old ES2 data to our ES7 cluster. But it was not my call to make, and the conclusion the team reached, is that it'd take too long and be too disruptive.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.