Looking for an advice on how to get rid of all deleted documents in 4TB indice on a monthly basis. This indice is not a time series and it has lots deletes and writes all day long.
These are the number of docs from _stats
"primaries" : {
"docs" : {
"count" : 4377351691,
"deleted" : 1276015486
},
"total" : {
"docs" : {
"count" : 8754708022,
"deleted" : 2565553924
},
When I look at the _segments output I can see some segments with 40% data deletes, here is a sample
https://gist.githubusercontent.com/ofrivera/a16d67cfa4e3b59c21db1a2b4e8615c7/raw/754e1d46ab468e448dbe40e3b0a3d6208f1706b0/gistfile1.txt
Some options I'm considering:
- Create a copy of the indice (via snapshot/restore), convert to read only, expunge deletes, sync deltas.
- _forcemerge but what options you suggest?, specially to prevent ending up with huge segments.
- Or just trying to be more aggressive with index.merge.scheduler.max_thread_count any suggestion?
Some details about cluster setup:
Specs
AWS i3.4xlarge
16 nodes (all data nodes)
120gb RAM memory
16cpus
Health
{
"cluster_name" : "elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 16,
"number_of_data_nodes" : 16,
"active_primary_shards" : 629,
"active_shards" : 1258,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
OS
CentOS Linux release 7.5.1804 (Core)
3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Nov 30 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
RPM:
elasticsearch-6.4.2.rpm
Details:
elasticsearch-6.4.2-1.noarch
JAVA:
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
Thanks!