Hi,
We are running v6.4.0 of logstash. Our setup has 2 unconnected pipelines. One pipeline 0_main
handles logs from other hosts (shipped via filebeat) and writes to elasticsearch. The other pipeline 2_monitor
queries logstash API and writes health data to a graphite server. Since v6.3.0 a third pipeline, the xpack monitoring pipeline, is automatically a part of our setup.
Yesterday we had a power issue which led to restart of our entire infrastructure (logstash and graphite). The graphite process failed to start due to an unrelated error. Logstash hosts started up all right but because the graphite host could not be reached, the 2_monitor
pipeline went into an infinite loop at startup (looking for the connection to graphite output). My understanding of separate pipelines is that an issue in one pipeline should not affect the other pipeline. However, this morning we noticed that logstash had not fully started.
Here is a snapshot of the log for entire restart sequence
[2018-09-09T01:29:59,280][INFO ][logstash.runner ] Starting Logstash {"logstash.version"=>"6.4.0"}
[2018-09-09T01:30:01,925][WARN ][org.logstash.ackedqueue.Queue] recovered head data page 5081 is different than checkpoint, using recovered page information
[2018-09-09T01:30:02,492][INFO ][logstash.pipeline ] Starting pipeline {:pipeline_id=>".monitoring-logstash", "pipeline.workers"=>1, "pipeline.batch.size"=>2, "pipeline.batch.delay"=>50}
[2018-09-09T01:30:02,837][INFO ][logstash.pipeline ] Starting pipeline {:pipeline_id=>"2_monitor", "pipeline.workers"=>1, "pipeline.batch.size"=>1, "pipeline.batch.delay"=>50}
[2018-09-09T01:30:03,084][WARN ][logstash.outputs.graphite] Connection refused to graphite server, sleeping... {:host=>"statsmaster", :port=>2003}
[2018-09-09T01:30:03,129][INFO ][logstash.outputs.elasticsearch] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://esdatanodes:9200/]}}
[2018-09-09T01:30:03,208][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://esdatanodes:9200/, :path=>"/"}
[2018-09-09T01:30:03,328][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://esdatanodes:9200/"}
[2018-09-09T01:30:03,366][INFO ][logstash.outputs.elasticsearch] ES Output version determined {:es_version=>6}
[2018-09-09T01:30:03,366][WARN ][logstash.outputs.elasticsearch] Detected a 6.x and above cluster: the `type` event field won't be used to determine the document _type {:es_version=>6}
[2018-09-09T01:30:03,368][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["//esdatanodes"]}
[2018-09-09T01:30:03,396][INFO ][logstash.licensechecker.licensereader] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://esdatanodes:9200/]}}
[2018-09-09T01:30:03,396][INFO ][logstash.licensechecker.licensereader] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://esdatanodes:9200/, :path=>"/"}
[2018-09-09T01:30:03,403][WARN ][logstash.licensechecker.licensereader] Restored connection to ES instance {:url=>"http://esdatanodes:9200/"}
[2018-09-09T01:30:03,410][INFO ][logstash.licensechecker.licensereader] ES Output version determined {:es_version=>6}
[2018-09-09T01:30:03,473][INFO ][logstash.pipeline ] Pipeline started successfully {:pipeline_id=>".monitoring-logstash", :thread=>"#<Thread:0x6c848a91@/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:156 sleep>"}
[2018-09-09T01:30:05,086][WARN ][logstash.outputs.graphite] Connection refused to graphite server, sleeping... {:host=>"statsmaster", :port=>2003}
[2018-09-09T01:30:05,898][INFO ][logstash.pipeline ] Starting pipeline {:pipeline_id=>"0_main", "pipeline.workers"=>7, "pipeline.batch.size"=>1400, "pipeline.batch.delay"=>50}
[2018-09-09T01:30:05,901][WARN ][logstash.outputs.graphite] Connection refused to graphite server, sleeping... {:host=>"statsmaster", :port=>2003}
.......Repeated for 24 hours......
[2018-09-10T10:11:07,212][WARN ][logstash.outputs.graphite] Connection refused to graphite server, sleeping... {:host=>"statsmaster", :port=>2003}
[2018-09-10T10:11:07,212][WARN ][logstash.outputs.graphite] Connection refused to graphite server, sleeping... {:host=>"statsmaster", :port=>2003}
[2018-09-10T10:11:09,224][INFO ][logstash.inputs.http_poller] Registering http_poller Input {:type=>"metric", :schedule=>{"every"=>"30s"}, :timeout=>nil}
[2018-09-10T10:11:09,229][INFO ][logstash.outputs.elasticsearch] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://esdatanodes:9200/]}}
[2018-09-10T10:11:09,231][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://esdatanodes:9200/, :path=>"/"}
[2018-09-10T10:11:09,232][INFO ][logstash.pipeline ] Pipeline started successfully {:pipeline_id=>"2_monitor", :thread=>"#<Thread:0x1da04251@/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:156 sleep>"}
[2018-09-10T10:11:09,239][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://esdatanodes:9200/"}
[2018-09-10T10:11:09,247][INFO ][logstash.outputs.elasticsearch] ES Output version determined {:es_version=>6}
[2018-09-10T10:11:09,248][WARN ][logstash.outputs.elasticsearch] Detected a 6.x and above cluster: the `type` event field won't be used to determine the document _type {:es_version=>6}
[2018-09-10T10:11:09,249][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["//esdatanodes"]}
[2018-09-10T10:11:19,472][INFO ][logstash.pipeline ] Pipeline started successfully {:pipeline_id=>"0_main", :thread=>"#<Thread:0x445681d1@/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:156 sleep>"}
[2018-09-10T10:11:19,485][INFO ][logstash.agent ] Pipelines running {:count=>3, :running_pipelines=>[:"0_main", :"2_monitor", :".monitoring-logstash"], :non_running_pipelines=>[]}
[2018-09-10T10:11:19,486][INFO ][logstash.inputs.metrics ] Monitoring License OK
Logstash attempted to start the .monitoring-logstash
and 2_monitor
pipeline. The .monitoring-logstash
pipeline started up all right and so it tried to start the 0_main
pipeline. But 2_monitor
pipeline got stuck as graphite
output could not be instantiated. From here on, no attempt was made to continue the startup of 0_main
pipeline and logstash was in limbo for 30 hours. This morning, we started the graphite process and once the output was instantiated, the rest of logstash startup continued.
Any thoughts about why this happened?