As mentioned in topic, I call force merge api, to manually clean up deleted docs in an index. Is it possible and safe to update document or put new document in 'merging' index?
Yes but we generally advise reserving the pain and cost of performing "force merge" operations on indices where you don't expect to receive any more updates. Automatic merging happens naturally in the background as new docs are added or updated. Your "helping hand" is not typically required with merges unless you know something elasticsearch doesn't eg you know there's no more content to be added to an index.
I want to further ask a question. What is the condition of automatic merging? (perhaps I have to study lucene's mechanism?) I have this concern because my index has 6.7m docs.count and 6.5m docs.deleted, that I hope those deleted docs can be cleaned up to free up storage.
Mike McCandless has a great blog post on the Lucene internals.
Ordinarily segments with >50% are targets for rewriting through merge operations,
My guess is you've previously called force merge to create a single large segment and then done a bunch of deletes. This is where things get tricky because there's another limit in play which is the maximum segment size Lucene is allowed to create. Ordinarily, "natural" writes and merges maintain segments below this threshold but a force merge operation can create an uber-segment which exceeds this size which essentially becomes unmanaged at this point. Despite having > 50% deletes it is seen as too large and costly to be rewritten due to the max seg size threshold having been exceeded.
This serves as a good case in point for user intervention conflicting with objectives of internal algorithms and this is why we've progressively tried to remove user-facing controls over merge operations - it's like allowing 2 people to try drive the same car.
It's worth asking if your deletes are part of a data retention policy that ages out content e.g. deleting all data from the previous month? If this is the case you should look at using multiple time based indices rather than maintaining a single index. It is cheaper and quicker to drop a whole index rather than delete individual docs from an existing index.
I have a number of document update operations on that index. As I understand, calling update actually delete current document and create a new one with updated values. So frequent update results to my current issue.
In my production ES, I do have time-based indices, which are deleted directly once 'outdated'. But some important data is designated to a 'permanent' index. For example, to keep purchase status of every customer, my approach is to update 'last_purchase_date' and 'total_purchase' fields on doc_id: customerA. I cannot put these data to time-base indices as I have to keep full list of all customers, and increment total purchase value based on old document. It would be nice if there are better methods for my cases. Thanks.
Ok. So 6.7m docs minus 6.5m deletes = 0.2m customers?
As I mentioned before it sounds like the use of "force_merge" has left you with an oversized segment that is not being tidied automatically.
Customer is one case. There are other types of documents using same update approach to store permanent data.
As I remember, I did not force merge the index. But I tried to reindex from index A to index B, that index A was an old index force merged before and index B is my 6.7m-docs index. I guess it is unlikely the reason. Or any related metrics/settings I can provide to look into the issue?
We can deep-dive into the segments in your index using:
Please check the link for the outputs. It exceeds word limit so I have to use google drive.
My mistake. I assumed you had 6.7m docs of which 6.5m docs were deleted and this was mostly in one segment.
Things aren't nearly as bad as I thought and behaving normally. You currently have 37% of your space given to deleted docs likely caused by frequent updates . All segments are hovering under the 50% deleted threshold which will trigger their merging.
Here are some breakdowns of your results where each bar is a segment:
Sorry for a late reply. I found that i interpreted wrongly about cat indices api in ES. The columns of docs.count and docs.deleted are mutually exclusive.
Thank you for your illustration showing the index is normal. I also see value drop in deleted docs last week.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.