Logstash 7.5 GC too frequently to work

thank you for read my problem , sorry for bother you !!!

1. Env

  1. logstash version 7.5.0
  2. java version
java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)

  1. logstash.yml

pipeline.workers: 6
pipeline.batch.size: 50

  1. jvm.options
-Xms6g
-Xmx6g

2. Problem

i use the logstash to consum log from kafka after grok filter than write to Elasticsearch cluster,
but i find that my logstash process may go into very frequently GC time , the GC will consum most of the time that the logstash almost not work 。
the gc.log is all of this info


2021-11-27T17:36:21.048+0800: 78068.514: [CMS-concurrent-mark-start]
2021-11-27T17:36:23.878+0800: 78071.344: [Full GC (Allocation Failure) 2021-11-27T17:36:23.878+0800: 78071.344: [CMS2021-11-27T17:36:25.512+0800: 78072.978: [CMS-concurrent-mark: 4.462/4.464 secs] [Times: user=39.30 sys=0.82, real=4.47 secs]
 (concurrent mode failure): 4194303K->4194303K(4194304K), 35.7906436 secs] 6081790K->4204353K(6081792K), [Metaspace: 103267K->103267K(1155072K)], 35.7908146 secs] [Times: user=44.59 sys=0.67, real=35.79 secs]
2021-11-27T17:36:59.669+0800: 78107.135: Total time for which application threads were stopped: 35.7917723 seconds, Stopping threads took: 0.0000556 seconds
2021-11-27T17:36:59.673+0800: 78107.139: Total time for which application threads were stopped: 0.0027825 seconds, Stopping threads took: 0.0001835 seconds
2021-11-27T17:37:00.674+0800: 78108.140: Total time for which application threads were stopped: 0.0007674 seconds, Stopping threads took: 0.0000516 seconds
2021-11-27T17:37:01.669+0800: 78109.135: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4194303K(4194304K)] 4310321K(6081792K), 0.0121234 secs] [Times: user=0.10 sys=0.00, real=0.02 secs]
2021-11-27T17:37:01.681+0800: 78109.147: Total time for which application threads were stopped: 0.0129999 seconds, Stopping threads took: 0.0000897 seconds
2021-11-27T17:37:01.681+0800: 78109.147: [CMS-concurrent-mark-start]
2021-11-27T17:37:06.359+0800: 78113.825: [CMS-concurrent-mark: 4.678/4.678 secs] [Times: user=43.90 sys=1.14, real=4.67 secs]
2021-11-27T17:37:06.359+0800: 78113.825: [CMS-concurrent-preclean-start]
2021-11-27T17:37:06.569+0800: 78114.035: [Full GC (Allocation Failure) 2021-11-27T17:37:06.569+0800: 78114.035: [CMS2021-11-27T17:37:12.639+0800: 78120.106: [CMS-concurrent-preclean: 6.278/6.280 secs] [Times: user=7.09 sys=0.17, real=6.28 secs]
 (concurrent mode failure): 4194303K->4194303K(4194304K), 40.9531845 secs] 6081787K->4205126K(6081792K), [Metaspace: 103268K->103268K(1155072K)], 40.9533777 secs] [Times: user=39.95 sys=0.84, real=40.96 secs]
2021-11-27T17:37:47.523+0800: 78154.989: Total time for which application threads were stopped: 40.9546434 seconds, Stopping threads took: 0.0000614 seconds

i also use the jmap -histo:live 32409 to see the heap info , the topK is like this


 num     #instances         #bytes  class name
----------------------------------------------
   1:      17289297      691571880  org.jruby.util.ByteList
   2:      17245027      689801080  org.jruby.RubyString
   3:      17417048      659620560  [C
   4:      17334568      554706176  java.util.concurrent.ConcurrentHashMap$Node
   5:      17214558      550865856  org.logstash.FieldReference
   6:      17296590      514218192  [B
   7:      17416092      417986208  java.lang.String
   8:          8271      135421720  [Ljava.util.concurrent.ConcurrentHashMap$Node;
   9:        139671        4469472  java.util.HashMap$Node
  10:         74368        4338232  [Ljava.lang.Object;

i want to know how can i avoid this problem , sorry for bother you .

3. Logstash settings

Pipeline config file


input {

  kafka{

        bootstrap_servers => "xxx:9092"
        topics_pattern => "log_.*"
        auto_offset_reset => "latest"
        group_id => "log-es-find"
        consumer_threads => 1
        client_id => "node123"
        partition_assignment_strategy  => "org.apache.kafka.clients.consumer.RoundRobinAssignor"
        max_poll_interval_ms => "600000"
        max_poll_records => "100"

  codec => json {
          charset => "UTF-8"
      }
  }

}





filter {

    mutate {
        remove_field => [ "host","@timestamp" ]
      }

    if "_LOGV3_" not in  [message] {
          grok {
              match => { 
                  "message" => '^%{LOGLEVEL:log_level} %{TIMESTAMP_ISO8601:time} %{HOST:host_name} %{SERVICCE:service_type} %{STRING:process} %{STRING:thread} - %{SESSION_ID:session_id} %{GREEDYDATA:message_log}$' 
              }
              remove_field => ["message"]
          }

    } else {

          grok {
                match => { "message" => '^%{LOGLEVEL:log_level} %{TIMESTAMP_ISO8601:time} %{HOST:host_name} %{SERVICCE:service_type} %{STRING:process} %{STRING:thread} _LOGV3_ %{SESSION_ID:session_id} %{EVENT_MARK:event} %{GREEDYDATA:message_log}$' 
                }

          }

          json {

              source => "message_log"
              target => "doc.custom"
              remove_field => ["message","message_log"]

          }

          mutate {
                rename => { "doc.custom" => "doc.%{event}" }
          }

    }

   if "_groktimeout" in [tags] {

      grok {

         match => {"message" => "^%{LOGLEVEL:log_level} %{TIMESTAMP_ISO8601:time} %{HOST:host_name} %{SERVICCE:service_type}"}
         remove_tag => [ "_groktimeout" ]
         add_tag => [ "timeout" ]
      }
  }

  if "_grokparsefailure" in [tags] {

      grok {
          remove_tag => [ "_grokparsefailure" ]
          add_tag => [ "grok_failure" ]
          match => { "message" => '^%{LOGLEVEL:log_level} %{TIMESTAMP_ISO8601:time} %{HOST:host_name} %{SERVICCE:service_type} %{GREEDYDATA:message_log}$' }
      }
   }

  date {
        match => ["time",  "ISO8601"]
        target => "@timestamp"
        locale => "cn"
  }


  mutate {
          remove_field => ["time", "@metadata","host","input", "agent","ecs","log" ]
  }

  if "_grokparsefailure" in [tags] {

      mutate {
          add_field => {
            "service_type" => "grok_failure"
          }
      }
   }

   if "_groktimeout" in [tags] {
        mutate {
              add_field => {
                  "service_type" => "grok_time_out"
              }
        }
   }

}



output {

 elasticsearch {

    hosts => ["http://es:9200"]
    index => "mai_log_%{[service_type]}_detail_%{+YYYY.MM.dd}"
    user => "elastic"
    password => "xxx"

  }
}

i can make sure that the bottleneck is not at the Elasticsearch , because i have several logstash instance , others work well this time (but may go into the same problem in other time )

thank you for read my problem , sorry for bother you !!!

That feels a lot like a bug. Your jmap shows a lot of memory used by org.logstash.FieldReference. Does it seem remotely possible that your JSON could have 17 million different field names? I would expect that to exceed field limits in Elasticsearch.

That said, when I run this configuration

input { generator { count => 3 lines => [ '' ] } }
filter {
    ruby { init => '@n = 0' code => '@n += 1; event.set("[@metadata][json]", "{ \"field#{@n}\": 0 }")' }
    json { source => "[@metadata][json]" }
}
output { stdout { codec => dots } }

then once logstash has started (with configuration reloading enabled) and the pipeline terminated I see

tenured generation total 194360K, used 116616K [0x00000000e5ea0000, 0x00000000f1c6e000, 0x0000000100000000)

That is the baseline to get logstash up and running. If I change 3 to 300000 then that changes to

tenured generation total 217760K, used 206162K [0x00000000e5ea0000, 0x00000000f3348000, 0x0000000100000000)

So it has grown by about 80 MB. If I change 300000 to 300001 that changes to

tenured generation total 335940K, used 202347K [0x00000000e5ea0000, 0x00000000fa6b1000, 0x0000000100000000)

so re-using the field names in the same process does not significantly grow the tenured usage. If I change 300001 to 600001 then I get

tenured generation total 340276K, used 277891K [0x00000000e5ea0000, 0x00000000faaed000, 0x0000000100000000)

so again around 80 MB more tenured usage for 300000 new field names. I don't know why logstash wants to remember that it has, in the past, seen a field name, but it feels like a performance optimization.

If I change the count to 500000 then I am able to get a heap dump, and mat tells me that over half the heap is used by leak suspect #1....

The class org.logstash.FieldReference , loaded by jdk.internal.loader.ClassLoaders$AppClassLoader @ 0xe5ea4330 , occupies 121,394,832 (50.39%) bytes. The memory is accumulated in one instance of java.util.concurrent.ConcurrentHashMap$Node[] , loaded by , which occupies 119,199,664 (49.48%) bytes.

image

And after spending a couple of hours finding all that out, I see that there is an open issue for this, and a PR (that still needs work) to fix it.

2 Likes

really really thank you so much for help me find the reason, you are so cool :100: :+1: !!!

there is really a lot of different json field in my log event , the log event is not well designed .
i tried flattened field mapping in the Elasticsearch.

thank you again , i will try another way to solve this .

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.