Ways to purge deleted documents in 4TB indice


#1

Looking for an advice on how to get rid of all deleted documents in 4TB indice on a monthly basis. This indice is not a time series and it has lots deletes and writes all day long.

These are the number of docs from _stats

"primaries" : {
  "docs" : {
    "count" : 4377351691,
    "deleted" : 1276015486
  },
"total" : {
  "docs" : {
    "count" : 8754708022,
    "deleted" : 2565553924
  },

When I look at the _segments output I can see some segments with 40% data deletes, here is a sample
https://gist.githubusercontent.com/ofrivera/a16d67cfa4e3b59c21db1a2b4e8615c7/raw/754e1d46ab468e448dbe40e3b0a3d6208f1706b0/gistfile1.txt

Some options I'm considering:

  • Create a copy of the indice (via snapshot/restore), convert to read only, expunge deletes, sync deltas.
  • _forcemerge but what options you suggest?, specially to prevent ending up with huge segments.
  • Or just trying to be more aggressive with index.merge.scheduler.max_thread_count any suggestion?

Some details about cluster setup:
Specs

AWS i3.4xlarge
16 nodes (all data nodes)
120gb RAM memory
16cpus

Health

{
  "cluster_name" : "elasticsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 16,
  "number_of_data_nodes" : 16,
  "active_primary_shards" : 629,
  "active_shards" : 1258,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

OS
CentOS Linux release 7.5.1804 (Core)
3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Nov 30 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

RPM:
elasticsearch-6.4.2.rpm
Details:
elasticsearch-6.4.2-1.noarch

JAVA:
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)

Thanks!


(David Turner) #2

The total proportion of deletes in your index is ~25% which seems healthy for an index that is under load.

Importantly, why do you want to do this? The answer depends on the goal you're trying to achieve.


#3

We have noticed, that the cluster performance improves when we keep the % of deleted documents low, specially number of slow queries and queued tasks decreases, we have been using 2 ways in the past, using _forcemerge?only_expunge_deletes=true and reindex, but as the indice continues to grow this becomes a more challenging task.


(Nik Everett) #4

forcemerge can "upset" the algorithm that selects which segments to merge so it tends to cause more trouble in the long run unless the data set isn't changing. I believe some folks have done some recent work on making it not quite so bad upsetting to the algorithm, but I don't have a handy link and couldn't find it when it looked. Sorry!


(David Turner) #5

@nik9000 possibly you are thinking of https://issues.apache.org/jira/browse/LUCENE-7976 and/or https://github.com/elastic/elasticsearch/issues/32323?

I'm no expert in this area, but _forcemerge today seems like a bad idea because it will result in unnaturally large segments that are correspondingly harder to purge of deletes. It seems that changes are in the works so this may not be true in future.

Changing a thread count might help if your system were falling behind with its merges, but since you have a reasonable proportion of deletes it seems plausible that it's exactly where it wants to be.

That leaves the reindex idea.