Logstash loses records on 500 status code

One of the ES nodes in our cluster crashed today, and while the cluster kept functioning some of the bulk insert requests performed by logstash failed with the below error:

[2017-07-19T04:47:46,560][WARN ][logstash.outputs.elasticsearch] Failed action. {:status=>500, :action=>["index", {:_id=>"19e9c5dc-f23e-4adc-ad46-176869872768", :_index=>"computing.network_latency.08f694be-0619-485c-a3a4-0d7dd25cf1ef-2017.29", :_type=>"doc", :_routing=>nil}, 2017-07-19T04:45:42.038Z %{host} %{message}], :response=>{"index"=>{"_index"=>"computing.network_latency.08f694be-0619-485c-a3a4-0d7dd25cf1ef-2017.29", "_type"=>"doc", "_id"=>"19e9c5dc-f23e-4adc-ad46-176869872768", "status"=>500, "error"=>{"type"=>"node_not_connected_exception", "reason"=>"[data1-iil-005][10.184.95.5:9300] Node not connected"}}}}

This is obviously a temporary issue, so I would expect Logstash to retry - but it didn't, since 500 isn't a retryable code as far as logstash is concerned. Either way, this data was lost. We use Logstash 5.4, but even if we were using the DLQ feature (which is not clear to me in which version it is available), it wouldn't have helped since the elasticsearch output DLQ's only 400 and 404 errors.

Is this something that should be fixed in Logstash, or in ElasticSearch? While Logstash is correct to assume 500 means a permanent failure, this particular exception is somewhat of a gray area.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.