Transfering indices marked cold to the cold nodes cause GC overhead and other issues

Hi,I am using ES 5.5 with the hot-cold architecture, by marking indices cold, he transfer indices to the cold nodes from hot nodes automatically.

the cold nodes using 30GB jvm memory setting, still having the the following log .

[2017-09-06T15:23:23,583][INFO ][o.e.m.j.JvmGcMonitorService] [_GgVD4J] [gc][old][158124][26565] duration [5.2s], collections [1]/[5.6s], total [5.2s]/[2h], memory [29.8gb]->[28.9gb]/[29.8gb], all_pools {[young] [1.1gb]->[430.9mb]/[1.1gb]}{[survivor] [107.7mb]->[0b]/[149.7mb]}{[old] [28.5gb]->[28.5gb]/[28.5gb]}
[2017-09-06T15:23:36,217][INFO ][o.e.m.j.JvmGcMonitorService] [_GgVD4J] [gc][old][158128][26567] duration [5.2s], collections [1]/[5.6s], total [5.2s]/[2h], memory [29.8gb]->[28.9gb]/[29.8gb], all_pools {[young] [1.1gb]->[388.9mb]/[1.1gb]}{[survivor] [115.3mb]->[0b]/[149.7mb]}{[old] [28.5gb]->[28.5gb]/[28.5gb]}
[2017-09-06T15:23:48,736][INFO ][o.e.m.j.JvmGcMonitorService] [_GgVD4J] [gc][old][158132][26569] duration [5.1s], collections [1]/[5.6s], total [5.1s]/[2h], memory [29.7gb]->[28.9gb]/[29.8gb], all_pools {[young] [1.1gb]->[394.6mb]/[1.1gb]}{[survivor] [100.3mb]->[0b]/[149.7mb]}{[old] [28.5gb]->[28.5gb]/[28.5gb]}
[2017-09-06T15:23:48,737][WARN ][o.e.m.j.JvmGcMonitorService] [_GgVD4J] [gc][158132] overhead, spent [5.1s] collecting in the last [5.6s]
[2017-09-06T15:23:54,639][WARN ][o.e.m.j.JvmGcMonitorService] [_GgVD4J] [gc][158134] overhead, spent [4.4s] collecting in the last [4.9s]
[2017-09-06T15:24:01,120][WARN ][o.e.m.j.JvmGcMonitorService] [_GgVD4J] [gc][158136] overhead, spent [4.6s] collecting in the last [5.4s]
[2017-09-06T15:24:07,129][WARN ][o.e.m.j.JvmGcMonitorService] [_GgVD4J] [gc][158138] overhead, spent [4.2s] collecting in the last [5s]

and now the ES crash:

No Monitoring Data Found
No Monitoring data is available for the selected time period. This could be because no data is being sent to the cluster or data was not received during that time.

Try adjusting the time filter controls to a time range where the Monitoring data is expected.

Questions:
1.using 30GB jvm memory is the right opition?
2.the recovery speed too fast?

  "transient": {
    "cluster": {
      "routing": {
        "rebalance": {
          "enable": "none"
        },
        "allocation": {
          "cluster_concurrent_rebalance": "24",
          "enable": "all"
        }
      }
    },
    "indices": {
      "recovery": {
        "max_bytes_per_sec": "240mb"
      }
    },

How many shards are we looking at?

68 indices, 21 shards each indices cause i have 21 hot nodes with SSD, and the store size is 1.5T everyday.(I found one index is huge:1T above.) , I also have 5 cold node with 48T hdd hard disk.

I just restart the cold nodes ,now I got the marvel monitor data back with some unassigned indices, where did the unassigned indices store temporarily? will i loss data cause the restart and having replication num 0?

That's quite a lot.

Are you doing the move all at once? Are you shrinking before at all>?

what do you mean shrinking ? I move the indices by the curator , will forcemerge after moving.

Is that 68 indices being generated per day, each with 21 primary shards and possibly the same number of replica shards? What is your average and maximum shard size? What is your total retention period?

the indices were generated everyday, but replica shards is 0. as i said before , the store size is 1.5T everyday.(I found one index is huge:1T above.) ,

can you tell me what will cause the long time gc even the jvm memory setting is 30gb ?

It looks like you are suffering from heap pressure on the cold nodes. How many shards do you have per cold node? How much data do you have on each cold node?

It seems like your average shard size is just around 1GB on the hot nodes. This is quite small and will result in more overhead than if you used larger shards. An average shard size around 20GB or 30GB is not uncommon. I would therefore recommend dramatically reducing the number of shards for smaller indices as they probably waste a lot of resources.

you mean too large shards moved from hot nodes to cold nodes will cause long time gc ,or jvm oom?

No, I suspect that having too many small shards on the cold nodes is what is causing problems.

help ,help , I think it's the same root cause ,now the cluster is dead, the master keep telling :

[2017-09-08T13:58:14,426][DEBUG][o.e.a.a.i.m.p.TransportPutMappingAction] [BrSJ2NM] failed to put mappings on indices [[[pdc_20170908/37e0NIffQ6-Y_Cr5Bau5qQ]]], type [log]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping) within 30s
        at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$null$0(ClusterService.java:255) [elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher$$Lambda$2434/1961004673.accept(Unknown Source) [elasticsearch-5.5.0.jar:5.5.0]
        at java.util.ArrayList.forEach(ArrayList.java:1249) [?:1.8.0_40]
        at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$onTimeout$1(ClusterService.java:254) [elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher$$Lambda$2433/1790943430.run(Unknown Source) [elasticsearch-5.5.0.jar:5.5.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.5.0.jar:5.5.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_40]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_40]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_40]

how to recovery the cluster ASAP?

Having too many indices and shards is such a common problem that I created a blog post with some guidance and best practices.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.