One of the ES nodes in our cluster crashed today, and while the cluster kept functioning some of the bulk insert requests performed by logstash failed with the below error:
[2017-07-19T04:47:46,560][WARN ][logstash.outputs.elasticsearch] Failed action. {:status=>500, :action=>["index", {:_id=>"19e9c5dc-f23e-4adc-ad46-176869872768", :_index=>"computing.network_latency.08f694be-0619-485c-a3a4-0d7dd25cf1ef-2017.29", :_type=>"doc", :_routing=>nil}, 2017-07-19T04:45:42.038Z %{host} %{message}], :response=>{"index"=>{"_index"=>"computing.network_latency.08f694be-0619-485c-a3a4-0d7dd25cf1ef-2017.29", "_type"=>"doc", "_id"=>"19e9c5dc-f23e-4adc-ad46-176869872768", "status"=>500, "error"=>{"type"=>"node_not_connected_exception", "reason"=>"[data1-iil-005][10.184.95.5:9300] Node not connected"}}}}
This is obviously a temporary issue, so I would expect Logstash to retry - but it didn't, since 500 isn't a retryable code as far as logstash is concerned. Either way, this data was lost. We use Logstash 5.4, but even if we were using the DLQ feature (which is not clear to me in which version it is available), it wouldn't have helped since the elasticsearch output DLQ's only 400 and 404 errors.
Is this something that should be fixed in Logstash, or in ElasticSearch? While Logstash is correct to assume 500 means a permanent failure, this particular exception is somewhat of a gray area.