thank you for read my problem , sorry for bother you !!!
1. Env
- logstash version 7.5.0
- java version
java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)
- logstash.yml
pipeline.workers: 6
pipeline.batch.size: 50
- jvm.options
-Xms6g
-Xmx6g
2. Problem
i use the logstash to consum log from kafka after grok filter than write to Elasticsearch cluster,
but i find that my logstash process may go into very frequently GC time , the GC will consum most of the time that the logstash almost not work 。
the gc.log is all of this info
2021-11-27T17:36:21.048+0800: 78068.514: [CMS-concurrent-mark-start]
2021-11-27T17:36:23.878+0800: 78071.344: [Full GC (Allocation Failure) 2021-11-27T17:36:23.878+0800: 78071.344: [CMS2021-11-27T17:36:25.512+0800: 78072.978: [CMS-concurrent-mark: 4.462/4.464 secs] [Times: user=39.30 sys=0.82, real=4.47 secs]
(concurrent mode failure): 4194303K->4194303K(4194304K), 35.7906436 secs] 6081790K->4204353K(6081792K), [Metaspace: 103267K->103267K(1155072K)], 35.7908146 secs] [Times: user=44.59 sys=0.67, real=35.79 secs]
2021-11-27T17:36:59.669+0800: 78107.135: Total time for which application threads were stopped: 35.7917723 seconds, Stopping threads took: 0.0000556 seconds
2021-11-27T17:36:59.673+0800: 78107.139: Total time for which application threads were stopped: 0.0027825 seconds, Stopping threads took: 0.0001835 seconds
2021-11-27T17:37:00.674+0800: 78108.140: Total time for which application threads were stopped: 0.0007674 seconds, Stopping threads took: 0.0000516 seconds
2021-11-27T17:37:01.669+0800: 78109.135: [GC (CMS Initial Mark) [1 CMS-initial-mark: 4194303K(4194304K)] 4310321K(6081792K), 0.0121234 secs] [Times: user=0.10 sys=0.00, real=0.02 secs]
2021-11-27T17:37:01.681+0800: 78109.147: Total time for which application threads were stopped: 0.0129999 seconds, Stopping threads took: 0.0000897 seconds
2021-11-27T17:37:01.681+0800: 78109.147: [CMS-concurrent-mark-start]
2021-11-27T17:37:06.359+0800: 78113.825: [CMS-concurrent-mark: 4.678/4.678 secs] [Times: user=43.90 sys=1.14, real=4.67 secs]
2021-11-27T17:37:06.359+0800: 78113.825: [CMS-concurrent-preclean-start]
2021-11-27T17:37:06.569+0800: 78114.035: [Full GC (Allocation Failure) 2021-11-27T17:37:06.569+0800: 78114.035: [CMS2021-11-27T17:37:12.639+0800: 78120.106: [CMS-concurrent-preclean: 6.278/6.280 secs] [Times: user=7.09 sys=0.17, real=6.28 secs]
(concurrent mode failure): 4194303K->4194303K(4194304K), 40.9531845 secs] 6081787K->4205126K(6081792K), [Metaspace: 103268K->103268K(1155072K)], 40.9533777 secs] [Times: user=39.95 sys=0.84, real=40.96 secs]
2021-11-27T17:37:47.523+0800: 78154.989: Total time for which application threads were stopped: 40.9546434 seconds, Stopping threads took: 0.0000614 seconds
i also use the jmap -histo:live 32409
to see the heap info , the topK is like this
num #instances #bytes class name
----------------------------------------------
1: 17289297 691571880 org.jruby.util.ByteList
2: 17245027 689801080 org.jruby.RubyString
3: 17417048 659620560 [C
4: 17334568 554706176 java.util.concurrent.ConcurrentHashMap$Node
5: 17214558 550865856 org.logstash.FieldReference
6: 17296590 514218192 [B
7: 17416092 417986208 java.lang.String
8: 8271 135421720 [Ljava.util.concurrent.ConcurrentHashMap$Node;
9: 139671 4469472 java.util.HashMap$Node
10: 74368 4338232 [Ljava.lang.Object;
i want to know how can i avoid this problem , sorry for bother you .
3. Logstash settings
Pipeline config file
input {
kafka{
bootstrap_servers => "xxx:9092"
topics_pattern => "log_.*"
auto_offset_reset => "latest"
group_id => "log-es-find"
consumer_threads => 1
client_id => "node123"
partition_assignment_strategy => "org.apache.kafka.clients.consumer.RoundRobinAssignor"
max_poll_interval_ms => "600000"
max_poll_records => "100"
codec => json {
charset => "UTF-8"
}
}
}
filter {
mutate {
remove_field => [ "host","@timestamp" ]
}
if "_LOGV3_" not in [message] {
grok {
match => {
"message" => '^%{LOGLEVEL:log_level} %{TIMESTAMP_ISO8601:time} %{HOST:host_name} %{SERVICCE:service_type} %{STRING:process} %{STRING:thread} - %{SESSION_ID:session_id} %{GREEDYDATA:message_log}$'
}
remove_field => ["message"]
}
} else {
grok {
match => { "message" => '^%{LOGLEVEL:log_level} %{TIMESTAMP_ISO8601:time} %{HOST:host_name} %{SERVICCE:service_type} %{STRING:process} %{STRING:thread} _LOGV3_ %{SESSION_ID:session_id} %{EVENT_MARK:event} %{GREEDYDATA:message_log}$'
}
}
json {
source => "message_log"
target => "doc.custom"
remove_field => ["message","message_log"]
}
mutate {
rename => { "doc.custom" => "doc.%{event}" }
}
}
if "_groktimeout" in [tags] {
grok {
match => {"message" => "^%{LOGLEVEL:log_level} %{TIMESTAMP_ISO8601:time} %{HOST:host_name} %{SERVICCE:service_type}"}
remove_tag => [ "_groktimeout" ]
add_tag => [ "timeout" ]
}
}
if "_grokparsefailure" in [tags] {
grok {
remove_tag => [ "_grokparsefailure" ]
add_tag => [ "grok_failure" ]
match => { "message" => '^%{LOGLEVEL:log_level} %{TIMESTAMP_ISO8601:time} %{HOST:host_name} %{SERVICCE:service_type} %{GREEDYDATA:message_log}$' }
}
}
date {
match => ["time", "ISO8601"]
target => "@timestamp"
locale => "cn"
}
mutate {
remove_field => ["time", "@metadata","host","input", "agent","ecs","log" ]
}
if "_grokparsefailure" in [tags] {
mutate {
add_field => {
"service_type" => "grok_failure"
}
}
}
if "_groktimeout" in [tags] {
mutate {
add_field => {
"service_type" => "grok_time_out"
}
}
}
}
output {
elasticsearch {
hosts => ["http://es:9200"]
index => "mai_log_%{[service_type]}_detail_%{+YYYY.MM.dd}"
user => "elastic"
password => "xxx"
}
}
i can make sure that the bottleneck is not at the Elasticsearch , because i have several logstash instance , others work well this time (but may go into the same problem in other time )
thank you for read my problem , sorry for bother you !!!