Gc takes a lot of time

gensmusic · January 8, 2018, 2:24pm

we use elasticsearch to store log data.
Currently we have 16 data nodes and 3 master.
Each data node is 24 core cpu, 2.6Tx12 disk, and 32G memory.
We use half of the memory for elasticsearch.
We use index name like bussness-2018.01.01. When the shards number grow to 27000, the gc begins like

    [2018-01-08T13:32:13,437][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][1295560] overhead, spent [15.4s] collecting in the last [16.2s]
[2018-01-08T13:32:53,044][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][old][1295585][122299] duration [15.3s], collections [2]/[15.4s], total [15.3s]/[12.2h], memory [13.7gb]->[12.4gb]/[13.8gb], all_pools {[young] [1.1gb]->[2.4mb]/[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old] [12.4gb]->[12.4gb]/[12.5gb]}
[2018-01-08T13:32:53,044][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][1295585] overhead, spent [15.3s] collecting in the last [15.4s]
[2018-01-08T13:33:16,008][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][old][1295593][122300] duration [14.4s], collections [1]/[15.9s], total [14.4s]/[12.2h], memory [13.6gb]->[12.7gb]/[13.8gb], all_pools {[young] [1gb]->[253.4mb]/[1.1gb]}{[survivor] [131.8mb]->[0b]/[149.7mb]}{[old] [12.4gb]->[12.4gb]/[12.5gb]}
[2018-01-08T13:33:16,023][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][1295593] overhead, spent [14.9s] collecting in the last [15.9s]
[2018-01-08T13:33:33,060][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][old][1295596][122301] duration [14.7s], collections [1]/[15s], total [14.7s]/[12.2h], memory [13.4gb]->[12.2gb]/[13.8gb], all_pools {[young] [859.2mb]->[2.5mb]/[1.1gb]}{[survivor] [113.3mb]->[0b]/[149.7mb]}{[old] [12.5gb]->[12.1gb]/[12.5gb]}
[2018-01-08T13:33:33,060][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][1295596] overhead, spent [14.7s] collecting in the last [15s]
[2018-01-08T13:35:32,803][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][old][1295700][122316] duration [15.8s], collections [1]/[16.2s], total [15.8s]/[12.2h], memory [13.4gb]->[12.5gb]/[13.8gb], all_pools {[young] [985.1mb]->[163.6mb]/[1.1gb]}{[survivor] [125mb]->[0b]/[149.7mb]}{[old] [12.4gb]->[12.4gb]/[12.5gb]}
[2018-01-08T13:35:32,804][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][1295700] overhead, spent [15.8s] collecting in the last [16.2s]
[2018-01-08T13:35:49,828][WARN ][o.e.m.j.JvmGcMonitorService] [es-data-m-2] [gc][old][1295703][122317] duration [14.7s], collections [1]/[15s], total [14.7s]/[12.2h], memory [13gb]->[12.1gb]/[13.8gb], all_pools {[young] [448.8mb]->[27mb]/[1.1gb]}{[survivor] [116.6mb]->[0b]/[149.7mb]}{[old] [12.4gb]->[12.1gb]/[12.5gb]}

Then we delete some indices, it's 17000, everything is ok.
But two days later, the same problem occurs.
We have to merge some small indices into a big one. Then we down the number of shards to 9000.
It went well for about 3 days. gc problems occurs again.
We delete some indices to make it 7000, Now it's ok.
How should we improve the gc performace ? We cann't continue to delete the indices.
And this is the jvm of a new restart data node.

{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "myesdb",
  "nodes" : {
    "pw4f0oWfT9KGSGi84AKm4w" : {
      "timestamp" : 1515421426980,
      "name" : "es-data-m-9",
      "transport_address" : "10.83.56.10:9300",
      "host" : "10.83.56.10",
      "ip" : "10.83.56.10:9300",
      "roles" : [
        "data",
        "ingest"
      ],
      "jvm" : {
        "timestamp" : 1515421426981,
        "uptime_in_millis" : 11791868,
        "mem" : {
          "heap_used_in_bytes" : 10351075064,
          "heap_used_percent" : 69,
          "heap_committed_in_bytes" : 14875361280,
          "heap_max_in_bytes" : 14875361280,
          "non_heap_used_in_bytes" : 145224888,
          "non_heap_committed_in_bytes" : 152145920,
          "pools" : {
            "young" : {
              "used_in_bytes" : 988229576,
              "max_in_bytes" : 1256259584,
              "peak_used_in_bytes" : 1256259584,
              "peak_max_in_bytes" : 1256259584
            },
            "survivor" : {
              "used_in_bytes" : 154718320,
              "max_in_bytes" : 157024256,
              "peak_used_in_bytes" : 157024256,
              "peak_max_in_bytes" : 157024256
            },
            "old" : {
              "used_in_bytes" : 9208127168,
              "max_in_bytes" : 13462077440,
              "peak_used_in_bytes" : 13088728336,
              "peak_max_in_bytes" : 13462077440
            }
          }
        },
        "threads" : {
          "count" : 272,
          "peak_count" : 314
        },
        "gc" : {
          "collectors" : {
            "young" : {
              "collection_count" : 2845,
              "collection_time_in_millis" : 168440
            },
            "old" : {
              "collection_count" : 981,
              "collection_time_in_millis" : 202767
            }
          }
        },
        "buffer_pools" : {
          "direct" : {
            "count" : 274,
            "used_in_bytes" : 815798639,
            "total_capacity_in_bytes" : 815798638
          },
          "mapped" : {
            "count" : 22171,
            "used_in_bytes" : 1962403489862,
            "total_capacity_in_bytes" : 1962403489862
          }
        },
        "classes" : {
          "current_loaded_count" : 12300,
          "total_loaded_count" : 12511,
          "total_unloaded_count" : 211
        }
      }
    }
  }
}

val · January 8, 2018, 2:28pm

27K shards sounds like a lot of shards (almost 1.5K per data node). Maybe you have too many shards given the size of your data.

How many indices do you have?
How much data are you storing in each of your daily indices?

Christian_Dahlqvist · January 8, 2018, 2:29pm

Have a look at this blog post for some general guidelines.

gensmusic · January 8, 2018, 2:32pm

Now we have 7000 shards, about 368 indices. We have 2.5T store size everyday. And we use es 5.5.1

Christian_Dahlqvist · January 8, 2018, 2:41pm

What is your average shard size? How much heap have you got configured for each data node? Are you forcemerging indices no longer being written to (assuming you have time-based indices) in order to reduce the number of segments?

gensmusic · January 8, 2018, 2:46pm

the size of shard vary from 2G to 400G. It depends on the log data of that service. most of them are less than 100G. I use reindex to merge the small ones into a big one then delete the small ones. I use -Xms14G -Xmx14G to start the elasticsearch.

Christian_Dahlqvist · January 8, 2018, 2:49pm

If you are using time-based indices, you might want to look into using the rollover index API to create new indices only when they reach a particular age and/or size. This can save from having to reindex a lot of data and still allow you to handle peaky traffic without certain indices/shards getting too large. This is also described in this blog post.

val · January 8, 2018, 2:50pm

100G per shard sounds way too big. A size between 20GB-50GB per shard is much more reasonable. So you have a mix between too many shards that are too small (lots of 2GB shards) and also too many that are too big (lots of 100GB+ shards).

Indices with lots of data should probably be configured with more primary shards, while indices getting less data should be configured with less primary shards.

Also look into the rollover index API like Christian suggested.

gensmusic · January 8, 2018, 2:54pm

Thanks a lot, it really helps. I will do some research on these articles. And one more question, we use filebeat to write data directly to es. does filebeat support rollover index api ??

val · January 8, 2018, 2:57pm

The Rollover index API is something you configure into ES directly and whenever an index grows too big (or gets too old), it gets rolled over. Filebeat has no notion whatsoever of that feature, it's transparent for it.

gensmusic · January 8, 2018, 2:58pm

Oh, I see, I just saw the api. Thanks a lot, I will close the issue for now. Again, thank you very much, guys.

Christian_Dahlqvist · January 8, 2018, 3:02pm

If you are not already, you can use curator to manage the rollover process.

Christian_Dahlqvist · January 8, 2018, 3:07pm

It looks like you are using ingest nodes. If so, I would generally also recommend using dedicated nodes for this.

gensmusic · January 8, 2018, 3:13pm

Ok, cool, thanks again. I will try curator, and use ingest nodes as dedicated too. Yeah, so happy to known how to do.

system · February 5, 2018, 3:14pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Reducing GC time and other cluster performance issues Elasticsearch	8	2543	July 5, 2017
Elasticsearch - Garbage Collection Issues Elasticsearch	6	2044	February 27, 2019
Reindex and Garbage Collection Elasticsearch	6	523	August 23, 2018
Elasticsearch taking a long time for garbage collection Elasticsearch	6	2473	July 6, 2017
Garbage collection log messages, [monitor.jvm ... duration [2.2m] Elasticsearch	8	11045	July 6, 2017

Gc takes a lot of time

Related topics