Dealing with high number of deleted documents

Hello everyone!

We have an elasticsearch index with 40 shards and 1 replica. We index live email data in this ES index - so the volume of deletes is also high! We have 2 data nodes and 3 master nodes in our cluster. For some reason, the distribution of our replicas and primary shards is done in such a way that all the replicas are in one data node and all the primary shards are in the other data node.

We found a very interesting pattern when checking the disk usage of these nodes. Node0 (with all replicas) is using 30GB more disk than node1(with all primary). We have approximately 1TB of data in our cluster (primary + replica). We started checking and we think that the discrepancy is due to the number of deleted documents being higher in replica shards. Around 24% of our total indexed documents are deleted!

We want to reduce the difference in the disk space usage. We were thinking of running a force merge on the index but we are not very sure if this is advisable for a 1TB index. How much time does it take (if anyone has practically done it) to force merging such index? Should we make the index read-only (as suggested in the documentation while running force merge?
Is there a better way to deal with this?

Here's a set of basic commands that we ran on our ES cluster

This is by design, a replica shard cannot be on the same node that already has its primary shard, if you have just one index then you will have on node with all primaries and one node with all replicas.

This also seems to be normal behaviour according to your use case, a replica shard will have the same documents of the primary shard, but will not necessarily distribute the documents in the same segments while indexing.

Since you have a lot of deletes and deleted documents are only removed when segments are merged, you may have these documents in different segments on primaries and replicas and they will be merged differently.

Also, I'm not sure what issue you are trying to solve, what you describe is normal behaviour.

And If you have 1 TB of data, a 30 GB of disk usage difference is just 3% of the total size.

A force merge is only recommended for index that do not receive write anymore, a read-only index.

While you can set your current index to read-only and do a force merge, I'm not sure this will make much difference since you will then start to writing on it after that, and after you delete some data you may have the size difference again.

Is this really causing any impact? As mentioned, this is normal behaviour.

Hey @leandrojmp, thanks for the detailed reply

So, there cannot exist a setup with 20 primary shards and 20 replica shards in one node out of the total 40 shards?

Got it, thanks! We were just trying to understand if there is a way we can run something periodically (say force-merge) to cleanup the deleted documets! But I understand your point! It is not causing any impact, we just wanted to confirm this is expected behaviour!

No, this goes against the motivation for replica shards.

Replicas are used to add resilience, if a node that has a primary shard goes down, the replica for that shard is promoted to primary and Elasticsearch will try to allocate another replica on a different node.

If you had the primary and replica on the same node, and this node goes down, you will not have access to your data, so Elasticsearch never allocates a replica on the same node that already has the primary shard for that replica.

Perhaps consider

only_expunge_deletes
(Optional, Boolean) If true, expunge all segments containing more than index.merge.policy.expunge_deletes_allowed (default to 10) percents of deleted documents. Defaults to false.

In Lucene, a document is not deleted from a segment; just marked as deleted. During a merge, a new segment is created that does not contain those document deletions.

You can’t specify this parameter and max_num_segments in the same request.

My understanding is that you can run this and then keep writing as it does not significantly reduce the segments... I would test first

Umm I still need a bit of clarification. Suppose, my index has 40 primary shards and replica as 1! If 20 primary shards (p0,...p19) and 20 replica shards(r20,.., r39) are present in one data node and other primary shards (p20,..., p39) and replica shards(r0,...r19) are present in other data node, what could be an issue?

Even if one node goes down, all the data is recoverable still! I agree that it does not make sense to have r0 & p0 (primary and replica of the same shard) on a single node

This can happen, but you do not have much control where Elasticsearch will allocate the shards.

Elasticsearch may allocate all primaries on the same node, it also can allocate half of the primaries on one node and the other half on the other node, or maybe it will choose to allocate 1/4 if your primaries on one node and the other 3/4 on the other node.

What it will cannot do is allocate the replica in the same node of its primary shard.

Also, assume that you have half of your primaries in each node and needs to restart one of the nodes for an upgrade, the replicas on the other node will be promoted to primaries and you will end up again with a node with all your primaries, when the other node comes back, elasticsearch will allocate the replicas.

1 Like

If you have 40 primary shards and 40 replica shards in total across 2 data nodes a primary and its replica will never reside on the same node. There can however be any number of primary shards onone of the nodes. If all primary shards are on one of the nodes it is possible that the node with only replica shards was the last one to start or has been restarted. This would cause all the shards on the other node to be promoted to primaries. Elasticsearch will manage this as it sees fit, but not necessarily try to balance the number of primary shards across nodes as primary and replica shards basically does the same amount of work.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.