Hi folks,
I've been running new ELK stack(5.5.1) for a month, Recently we found some problems on ELK side. we lost some metrics over time on Grafana. It's so bad.
and It's super slow to load fields when I try to customize graphics from Grafana.
Metricbeat:(every 5 to 10 minutes)
2017-11-23T15:04:17+08:00 ERR Failed to publish events caused by: EOF
2017-11-23T15:07:17+08:00 ERR Failed to publish events caused by: EOF
2017-11-23T15:13:17+08:00 ERR Failed to publish events caused by: EOF
2017-11-23T15:24:17+08:00 ERR Failed to publish events caused by: EOF
2017-11-23T15:37:17+08:00 ERR Failed to publish events caused by: EOF
2017-11-23T15:44:17+08:00 ERR Failed to publish events caused by: EOF
Logstash:
[2017-11-23T14:45:03,342][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://esdata14.tls.ad:9200/][Manticore::SocketTimeout] Read timed out {:url=>http://esdata14.tls.ad:9200/, :error_message=>"Elasticsearch Unreachable: [http://esdata14.tls.ad:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
[2017-11-23T14:45:03,342][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://esdata14.tls.ad:9200/][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}
[2017-11-23T14:45:04,373][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>#<Java::JavaNet::URI:0x63886b00>}
[2017-11-23T14:45:04,377][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>#<Java::JavaNet::URI:0x24c681ad>}
[2017-11-23T14:46:05,473][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://esdata20.tls.ad:9200/][Manticore::SocketTimeout] Read timed out {:url=>http://esdata20.tls.ad:9200/, :error_message=>"Elasticsearch Unreachable: [http://esdata20.tls.ad:9200/][Manticore::SocketTimeout] Read timed out",
Elasticsearch:
[2017-11-23T00:56:08,491][INFO ][o.e.m.j.JvmGcMonitorService] [esdata14.tls.ad] [gc][8665566] overhead, spent [252ms] collecting in the last [1s]
[2017-11-23T03:02:15,270][INFO ][o.e.m.j.JvmGcMonitorService] [esdata14.tls.ad] [gc][8673104] overhead, spent [277ms] collecting in the last [1s]
[2017-11-23T03:04:24,304][INFO ][o.e.m.j.JvmGcMonitorService] [esdata14.tls.ad] [gc][8673233] overhead, spent [264ms] collecting in the last [1s]
[2017-11-23T03:23:21,619][INFO ][o.e.m.j.JvmGcMonitorService] [esdata14.tls.ad] [gc][8674364] overhead, spent [266ms] collecting in the last [1s]
PS: our Architecture like this:
metricbeat(more than 200)->haproxy(2)->logstash(2)->ES(DATA:20,CLIENT:2:MASTER:3)
Server Specs
esdata-> Mem:48G, heap:23G, CPU:32Core
esmaster-> Mem:8G, heap:4G, CPU:4Core
esclient-> Mem:16G, heap:8G, CPU:4Core
haproxy->Mem:8G,CPU:4Core
logstash->Mem:8G, heap:4G CPU:2Core
#configure file for logstash
# ------------------------------------ Pipeline Settings ------------------------------------
# This defaults to the number of the host's CPU cores.
pipeline.workers: 2
# How many workers should be used per output plugin instance
#pipeline.output.workers: 2
# How many events to retrieve from inputs before sending to filters+workers
pipeline.batch.size: 5000
# How long to wait before dispatching an undersized batch to filters+workers
# Value is in milliseconds.
pipeline.batch.delay: 1000
pipline for metricbeat
#logstash for metricbeat
input {
beats {
port => 5044
}
}
filter {
mutate { add_field => { "[@metadata][index_prefix]" => "%{agent}" } }
}
output {
elasticsearch {
hosts => ["esdata01.tls.ad:9200","esdata02.tls.ad:9200","esdata03.tls.ad:9200","esdata04.tls.ad:9200","esdata05.tls.ad:9200","esdata06.tls.ad:9200","esdata07.tls.ad:9200",
"esdata08.tls.ad:9200","esdata09.tls.ad:9200","esdata10.tls.ad:9200","esdata11.tls.ad:9200","esdata12.tls.ad:9200","esdata13.tls.ad:9200","esdata14.tls.ad:9200",
"esdata15.tls.ad:9200","esdata16.tls.ad:9200","esdata17.tls.ad:9200","esdata18.tls.ad:9200","esdata19.tls.ad:9200","esdata20.tls.ad:9200"]
template_overwrite => false
manage_template => false
index => "%{[@metadata][index_prefix]}-%{+xxxx.ww}"
#index => "%{[@metadata][index_prefix]}-%{+YYYY.MM.dd}"
sniffing => false
pool_max => 10000
pool_max_per_route => 1000
}
}
Any suggestions or ideas to avoid logstash to drop events?