Thanks for the reference.
Yes, the GCs can be observed from the logs. For example, during a single day, an instance performed 10k GC runs. This is in contrast to normal days where only 50-500 GCs are performed:
[2020-04-28T06:25:18,491][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449522] overhead, spent [371ms] collecting in the last [1.3s]
[2020-04-28T06:25:23,492][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449527] overhead, spent [393ms] collecting in the last [1s]
[2020-04-28T06:25:27,561][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449531] overhead, spent [325ms] collecting in the last [1s]
[2020-04-28T06:25:31,900][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449535] overhead, spent [553ms] collecting in the last [1.3s]
[2020-04-28T06:25:36,355][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449539] overhead, spent [704ms] collecting in the last [1.4s]
[2020-04-28T06:25:41,442][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449544] overhead, spent [401ms] collecting in the last [1s]
[2020-04-28T06:25:45,620][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449548] overhead, spent [355ms] collecting in the last [1.1s]
[2020-04-28T06:25:50,623][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449553] overhead, spent [460ms] collecting in the last [1s]
[2020-04-28T06:25:54,626][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449557] overhead, spent [462ms] collecting in the last [1s]
[2020-04-28T06:25:59,630][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449562] overhead, spent [358ms] collecting in the last [1s]
[2020-04-28T06:26:03,634][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449566] overhead, spent [375ms] collecting in the last [1s]
[2020-04-28T06:26:08,637][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449571] overhead, spent [407ms] collecting in the last [1s]
[2020-04-28T06:26:12,640][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449575] overhead, spent [296ms] collecting in the last [1s]
[2020-04-28T06:26:26,707][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449589] overhead, spent [373ms] collecting in the last [1s]
[2020-04-28T06:26:30,710][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449593] overhead, spent [289ms] collecting in the last [1s]
[2020-04-28T06:26:35,713][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449598] overhead, spent [322ms] collecting in the last [1s]
# ...
[2020-04-28T10:40:55,696][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][old][464462][103504] duration [5.3s], collections [1]/[5.5s], total [5.3s]/[9.2h], memory [2.8gb]->[2.3gb]/[2.9gb], all_pools {[young] [819.2mb]->[381.1mb]/[819.2mb]}{[survivor] [97.5mb]->[0b]/[102.3mb]}{[old] [1.9gb]->[1.9gb]/[2gb]}
The above goes on for the whole day; every minute of the day ~4-5 GCs are performed.
After the removal of the unused indices (and after the load got back to normal), each ES cluster has now the following:
- 1 node
- 27 indices
- 68 shards
- 80,184,574 docs
- 156.96GB
- heap used: 74% (as reported by cerebro)
- process CPU: 1% (as reported by cerebro)
- load: 3.44 (as reported by cerebro)
The 4 removed indices, were worth about 40GB and had up to 6 shards each.
From the post you've linked to, it seems to me that this is a relatively small cluster that would not justify so many/frequent GCs.
On the other hand, since the heap used right now is at 75%, maybe it's not too unrealistic that pre-deletion (with the 4 additional indexes), the heap was near 100% or some threshold that could cause GCs?
Does this make any sense? Thoughts?