The Pipeline is Blocked, due to node_not_connected_exception with connection to Data Nodes

After noticing an extreme slowdown in my pipeline (from 2500 events/sec to ~600 events/sec), I started logstash with verbosity and am seeing that the events I'm expecting to index are being dropped due to a node_not_connected_exception.

From a Logstash node;

:response=>{"create"=>{"_index"=>"filebeat-2016.12.19", "_type"=>"nginx-access", "_id"=>"AVkdVHPfvcAQ2OYuCYwA", "status"=>500, "error"=>{"type"=>"node_not_connected_exception", "reason"=>"[hyd-mon-storage02][] Node not connected"}}}, :level=>:warn}

However, the index has obviously already been created and Logstash is no longer even pointed at the data nodes -- only the two client nodes. Has anyone seen this issue before? None of these events are making it into Logstash anymore and there's nothing I can think of that's been changed recently to cause this.

OS: Ubuntu 14.04.3
LS: 2.2.4
ES: 2.4.1

It doesn't appear many folks that have had the same issue are getting much assistance, here's what I've done in order to increase throughput so far (back up to 1800 events/sec from 600, goal is getting back to ~2500);

  1. Redirected Logstash output away from the Data Nodes, strictly pointed at client nodes. I wouldn't have thought to direct LS data towards Data Nodes, but it was instructed elsewhere in the forums.

  2. Disable Scatter-Gather on the network cards. As instructed by other users with the same issue, I disabled this functionality of the network cards.

Between these two changes, the indexing rate appears to be getting back up to the typical rate.

I hope this helps others, please let me know if I missed anything or if you have any additional suggestions.

Is there other things in Monitoring that might highlight the cause, GC on your nodes for eg?

To be fair, you only waited 3 hours for a response :slight_smile:

I was referring to searching the forums and seeing mostly single-post threads regarding this issue. I know you guys are busy and not slacking. I wanted to point out that I was updating my findings so the next person to search this issue might be able to gain some immediate insight.

@warkolm as you probably figured, the changes I referred to previously did not solve my issue in the long run. There is GC occasionally occurring on nodes and I'm moving to new hardware in the next few weeks, but looking for a way to maintain monitoring through then.

I decided to take this downtime to upgrade everything to 5.1. Will keep you posted once everything is sorted out, thanks for the response and please let me know if I can provide any additional info to help you analyze my issue.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.