Logstash can't talk to elasticsearch after several hours' running

I found that after several hours running, my logstash can't connect to elasticsearch any more. From logstash log, it said:

{:timestamp=>"2016-10-17T04:39:40.312000+0200", :message=>"Starting pipeline", :id=>"main", :pipeline_workers=>8, :batch_size=>1000, :batch_delay=>5, :max_inflight=>8000, :level=>:info}
{:timestamp=>"2016-10-17T04:39:40.325000+0200", :message=>"Pipeline main started"}
{:timestamp=>"2016-10-17T06:18:55.520000+0200", :message=>"Attempted to send a bulk request to Elasticsearch configured at '["http://146.89.179.204:9200"]', but Elasticsearch appears to be unreachable or down!", :error_message=>"Connection timed out", :class=>"Manticore::SocketException", :level=>:error}

However, I checked my elasticsearch was still running.

What I'm using:

  • Filebeat: 1.2.3
  • Logstash: 2.3.4
  • Elasticsearch: 2.3.5

Any help greatly appreciated. Thanks a lot!

Anyone met this issue before? Or anyone know what caused this kind of issue?

I found some logs from logstash side. It said there were 3 running workers(8 workers in total), and each of these running workers were working on the job specified by the last plugin date.

{:timestamp=>"2016-10-18T09:42:15.320000+0200", :message=>#<LogStash::PipelineReporter::Snapshot:0x7241d839 @data={:events_filtered=>58098, :events_consumed=>58098, :worker_count=>8, :inflight_count=>179, :worker_states=>[{:status=>"dead", :alive=>false, :index=>0, :inflight_count=>0}, {:status=>"run", :alive=>true, :index=>1, :inflight_count=>36}, {:status=>"dead", :alive=>false, :index=>2, :inflight_count=>0}, {:status=>"dead", :alive=>false, :index=>3, :inflight_count=>0}, {:status=>"dead", :alive=>false, :index=>4, :inflight_count=>0}, {:status=>"run", :alive=>true, :index=>5, :inflight_count=>88}, {:status=>"dead", :alive=>false, :index=>6, :inflight_count=>0}, {:status=>"run", :alive=>true, :index=>7, :inflight_count=>55}], :output_info=>[{:type=>"elasticsearch", :config=>{"hosts"=>"146.89.179.204", "index"=>"logstash-site-%{+YYYY.MM.dd}", "workers"=>8, "flush_size"=>1000, "ALLOW_ENV"=>false}, :is_multi_worker=>true, :events_received=>58098, :workers=><Java::JavaUtilConcurrent::CopyOnWriteArrayList:-1980582399 [<LogStash::Outputs::ElasticSearch hosts=>["146.89.179.204"], index=>"logstash-site-%{+YYYY.MM.dd}", workers=>8, flush_size=>1000, codec=><LogStash::Codecs::Plain charset=>"UTF-8">, manage_template=>true, template_name=>"logstash", template_overwrite=>false, idle_flush_time=>1, doc_as_upsert=>false, max_retries=>3, script_type=>"inline", script_var_name=>"event", scripted_upsert=>false, retry_max_interval=>2, retry_max_items=>500, retry_on_conflict=>1, action=>"index", ssl_certificate_verification=>true, sniffing=>false, sniffing_delay=>5>, <LogStash::Outputs::ElasticSearch hosts=>["146.89.179.204"], index=>"logstash-site-%{+YYYY.MM.dd}", workers=>8, flush_size=>1000, codec=><LogStash::Codecs::Plain charset=>"UTF-8">, manage_template=>true, template_name=>"logstash", template_overwrite=>false, idle_flush_time=>1, doc_as_upsert=>false, max_retries=>3, script_type=>"inline", script_var_name=>"event", scripted_upsert=>false, retry_max_interval=>2, retry_max_items=>500, retry_on_conflict=>1, action=>"index", ssl_certificate_verification=>true, sniffing=>false, sniffing_delay=>5>, <LogStash::Outputs::ElasticSearch hosts=>["146.89.179.204"], index=>"logstash-site-%{+YYYY.MM.dd}", workers=>8, flush_size=>1000, codec=><LogStash::Codecs::Plain charset=>"UTF-8">, manage_template=>true, template_name=>"logstash", template_overwrite=>false, idle_flush_time=>1, doc_as_upsert=>false, max_retries=>3, script_type=>"inline", script_var_name=>"event", scripted_upsert=>false, retry_max_interval=>2, retry_max_items=>500, retry_on_conflict=>1, action=>"index", ssl_certificate_verification=>true, sniffing=>false, sniffing_delay=>5>, <LogStash::Outputs::ElasticSearch hosts=>["146.89.179.204"], index=>"logstash-site-%{+YYYY.MM.dd}", workers=>8, flush_size=>1000, codec=><LogStash::Codecs::Plain charset=>"UTF-8">, manage_template=>true, template_name=>"logstash", template_overwrite=>false, idle_flush_time=>1, doc_as_upsert=>false, max_retries=>3, script_type=>"inline", script_var_name=>"event", scripted_upsert=>false, retry_max_interval=>2, retry_max_items=>500, retry_on_conflict=>1, action=>"index", ssl_certificate_verification=>true, sniffing=>false, sniffing_delay=>5>,

this log line is too long, the rest please look at another response below.

continued log:

:busy_workers=>3}], :thread_info=>[{"thread_id"=>28, "name"=>"[main]>worker1", "plugin"=>["LogStash::Filters::Date", {"match"=>["Logtime", "M/d/yy HH:mm:ss:SSS z"], "timezone"=>"UTC"}], "backtrace"=>["[...]/vendor/bundle/jruby/1.9/gems/manticore-0.6.0-java/lib/manticore/response.rb:50:in call'", "[...]/vendor/bundle/jruby/1.9/gems/manticore-0.6.0-java/lib/manticore/response.rb:256:incall_once'", "[...]/vendor/bundle/jruby/1.9/gems/manticore-0.6.0-java/lib/manticore/response.rb:153:in code'", "[...]/vendor/bundle/jruby/1.9/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/http/manticore.rb:84:inperform_request'", "[...]/vendor/bundle/jruby/1.9/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/base.rb:257:in call'", "[...]/vendor/bundle/jruby/1.9/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/base.rb:257:inperform_request'", "[...]/vendor/bundle/jruby/1.9/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/http/manticore.rb:67:in perform_request'", "[...]/vendor/bundle/jruby/1.9/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/client.rb:128:inperform_request'", "[...]/vendor/bundle/jruby/1.9/gems/elasticsearch-api-1.0.18/lib/elasticsearch/api/actions/bulk.rb:90:in bulk'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.7.1-java/lib/logstash/outputs/elasticsearch/http_client.rb:53:innon_threadsafe_bulk'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.7.1-java/lib/logstash/outputs/elasticsearch/http_client.rb:38:in bulk'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.7.1-java/lib/logstash/outputs/elasticsearch/http_client.rb:38:insynchronize'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.7.1-java/lib/logstash/outputs/elasticsearch/http_client.rb:38:in bulk'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.7.1-java/lib/logstash/outputs/elasticsearch/common.rb:172:insafe_bulk'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.7.1-java/lib/logstash/outputs/elasticsearch/common.rb:101:in submit'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.7.1-java/lib/logstash/outputs/elasticsearch/common.rb:86:inretrying_submit'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.7.1-java/lib/logstash/outputs/elasticsearch/common.rb:29:in multi_receive'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.7.1-java/lib/logstash/outputs/elasticsearch/common.rb:28:ineach_slice'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.7.1-java/lib/logstash/outputs/elasticsearch/common.rb:28:in multi_receive'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.4-java/lib/logstash/output_delegator.rb:130:inworker_multi_receive'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.4-java/lib/logstash/output_delegator.rb:129:in worker_multi_receive'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.4-java/lib/logstash/output_delegator.rb:114:inmulti_receive'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.4-java/lib/logstash/pipeline.rb:301:in output_batch'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.4-java/lib/logstash/pipeline.rb:301:ineach'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.4-java/lib/logstash/pipeline.rb:301:in output_batch'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.4-java/lib/logstash/pipeline.rb:232:inworker_loop'", "[...]/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.4-java/lib/logstash/pipeline.rb:201:in start_workers'"], "blocked_on"=>nil, "status"=>"run", "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/manticore-0.6.0-java/lib/manticore/response.rb:50:in \call'"}], :stalling_threads_info=>[{"thread_id"=>28, "name"=>"[main]>worker1", "plugin"=>["LogStash::Filters::Date", {"match"=>["Logtime", "M/d/yy HH:mm:ss:SSS z"], "timezone"=>"UTC"}], "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/manticore-0.6.0-java/lib/manticore/response.rb:50:in `call'"}, {"thread_id"=>32, "name"=>"[main]>worker5", "plugin"=>["LogStash::Filters::Date", {"match"=>["Logtime", "M/d/yy HH:mm:ss:SSS z"], "timezone"=>"UTC"}], "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/manticore-0.6.0-java/lib/manticore/response.rb:50:in `call'"}, {"thread_id"=>34, "name"=>"[main]>worker7", "plugin"=>["LogStash::Filters::Date", {"match"=>["Logtime", "M/d/yy HH:mm:ss:SSS z"], "timezone"=>"UTC"}], "current_call"=>"[...]/vendor/bundle/jruby/1.9/gems/manticore-0.6.0-java/lib/manticore/response.rb:50:in `call'"}]}>, :level=>:warn}

Is there anything in the Elasticsearch logs around the time the problems occurred?

No. I didn't see anything in Elasticsearch logs. Every time I run into this issue, it can be resolved after I kill (kill -9) logstash, and start it again. I can't stop logstash safely.

I have two VMs A and B, A is used as Logstash server, and B is both Logstash server and Elasticsearch server. The connection problem was only occurred in A.

I'm not sure if this information helps to identify the error. These days, this issue came out very often, which has impacted our daily use greatly.

Any help greatly appreciated. Thanks so much!

Hi.
Our logstash agents also had this issue. After a while they got the socket
exception and they could never recover. Restarting them was the only way.

The previous setup:
Logstash 2.1 and 2.3
Elasticsearch 1.7

I ended up editing common.rb in the elasticsearch output to get any info
about how many events it tries to send and how big. Often the size of the
bulk request was small: 1 event and 500Bytes so probably not too big bulk
request then :slight_smile:

I ended up upgrading all agents to 2.4 and the error disappeared except
from one node that sends around 200k events/hour. The good thing is that
it can recover itself now.

Hope it helps

//Rickard

Hi Rickard,

Thanks so much for your information. I'll try the upgrade later to see if it can resolve my issue. Any progress will be updated here. Thanks again.