Logstash reject the metrics

Hi folks,

I've been running new ELK stack(5.5.1) for a month, Recently we found some problems on ELK side. we lost some metrics over time on Grafana. It's so bad.
and It's super slow to load fields when I try to customize graphics from Grafana.

Metricbeat:(every 5 to 10 minutes)

2017-11-23T15:04:17+08:00 ERR Failed to publish events caused by: EOF
2017-11-23T15:07:17+08:00 ERR Failed to publish events caused by: EOF
2017-11-23T15:13:17+08:00 ERR Failed to publish events caused by: EOF
2017-11-23T15:24:17+08:00 ERR Failed to publish events caused by: EOF
2017-11-23T15:37:17+08:00 ERR Failed to publish events caused by: EOF
2017-11-23T15:44:17+08:00 ERR Failed to publish events caused by: EOF

Logstash:

[2017-11-23T14:45:03,342][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://esdata14.tls.ad:9200/][Manticore::SocketTimeout] Read timed out {:url=>http://esdata14.tls.ad:9200/, :error_message=>"Elasticsearch Unreachable: [http://esdata14.tls.ad:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
[2017-11-23T14:45:03,342][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://esdata14.tls.ad:9200/][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}
[2017-11-23T14:45:04,373][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>#<Java::JavaNet::URI:0x63886b00>}
[2017-11-23T14:45:04,377][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>#<Java::JavaNet::URI:0x24c681ad>}
[2017-11-23T14:46:05,473][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://esdata20.tls.ad:9200/][Manticore::SocketTimeout] Read timed out {:url=>http://esdata20.tls.ad:9200/, :error_message=>"Elasticsearch Unreachable: [http://esdata20.tls.ad:9200/][Manticore::SocketTimeout] Read timed out", 

Elasticsearch:

[2017-11-23T00:56:08,491][INFO ][o.e.m.j.JvmGcMonitorService] [esdata14.tls.ad] [gc][8665566] overhead, spent [252ms] collecting in the last [1s]
[2017-11-23T03:02:15,270][INFO ][o.e.m.j.JvmGcMonitorService] [esdata14.tls.ad] [gc][8673104] overhead, spent [277ms] collecting in the last [1s]
[2017-11-23T03:04:24,304][INFO ][o.e.m.j.JvmGcMonitorService] [esdata14.tls.ad] [gc][8673233] overhead, spent [264ms] collecting in the last [1s]
[2017-11-23T03:23:21,619][INFO ][o.e.m.j.JvmGcMonitorService] [esdata14.tls.ad] [gc][8674364] overhead, spent [266ms] collecting in the last [1s]

PS: our Architecture like this:

metricbeat(more than 200)->haproxy(2)->logstash(2)->ES(DATA:20,CLIENT:2:MASTER:3)

Server Specs
esdata-> Mem:48G, heap:23G, CPU:32Core
esmaster-> Mem:8G, heap:4G, CPU:4Core
esclient-> Mem:16G, heap:8G, CPU:4Core
haproxy->Mem:8G,CPU:4Core
logstash->Mem:8G, heap:4G CPU:2Core

#configure file for logstash

# ------------------------------------ Pipeline Settings ------------------------------------

# This defaults to the number of the host's CPU cores.
pipeline.workers: 2

# How many workers should be used per output plugin instance
#pipeline.output.workers: 2

# How many events to retrieve from inputs before sending to filters+workers
pipeline.batch.size: 5000

# How long to wait before dispatching an undersized batch to filters+workers
# Value is in milliseconds.
pipeline.batch.delay: 1000

pipline for metricbeat

#logstash for metricbeat
input {
  beats {
    port => 5044
  }
}

filter {
   mutate  {  add_field => { "[@metadata][index_prefix]" => "%{agent}" } }

}


output {
  elasticsearch {
     hosts => ["esdata01.tls.ad:9200","esdata02.tls.ad:9200","esdata03.tls.ad:9200","esdata04.tls.ad:9200","esdata05.tls.ad:9200","esdata06.tls.ad:9200","esdata07.tls.ad:9200",
               "esdata08.tls.ad:9200","esdata09.tls.ad:9200","esdata10.tls.ad:9200","esdata11.tls.ad:9200","esdata12.tls.ad:9200","esdata13.tls.ad:9200","esdata14.tls.ad:9200",
               "esdata15.tls.ad:9200","esdata16.tls.ad:9200","esdata17.tls.ad:9200","esdata18.tls.ad:9200","esdata19.tls.ad:9200","esdata20.tls.ad:9200"]
     template_overwrite => false
     manage_template => false
     index => "%{[@metadata][index_prefix]}-%{+xxxx.ww}"
     #index => "%{[@metadata][index_prefix]}-%{+YYYY.MM.dd}"
     sniffing => false
     pool_max => 10000
     pool_max_per_route => 1000


   }
}

Any suggestions or ideas to avoid logstash to drop events?

when I enabled metricbeat send events to logstash instead of haproxy,
I found there are some error messages like this here:

2017-11-24T14:17:41+08:00 ERR Failed to publish events caused by: write tcp 10.0.0.1:55268->10.0.0.2:5044: write: connection reset by peer

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.