I am running version 5.5.2 stack of logstash and elasticsearch in a hot-warm architecture. To free heap space on elastic running on warm nodes, oldest indices are closed.
The logstash pipeline reads messages from Kafka, filter them, and output to elastic.
Usually this happens toward hot nodes (SSD equipped) which contain the indices related to the last 3 days.
A problem on the rsyslog feeding Kafka caused ten-days old messages to be feed again in Kafka.
The logstash elasticsearch output plugin then tried to index them in an index yet closed.
What happens is the following:
logstash correctly signals in its logfile INFO like this one:
[2017-09-26T09:15:36,026][INFO ][logstash.outputs.elasticsearch][am-ops-1] retrying failed action with response code: 403 ({"type"=>"index_closed_exception", "reason"=>"closed", "index_uuid"=>"roxHCI09ToWHtx6gyXLEKA", "index"=>"ox_as_fe_ops-2017.09.18"})
after we have also:
[2017-09-26T09:15:36,026][INFO ][logstash.outputs.elasticsearch][am-ops-1] Retrying individual bulk actions that failed or were rejected by the previous bulk request. {:count=>1}
and after some time:
[2017-09-26T09:16:57,047][INFO ][logstash.outputs.elasticsearch][am-ops-1] Retrying individual bulk actions that failed or were rejected by the previous bulk request. {:count=>5}
After some time it seems that the pipeline stop to consume from Kafka. In fact the Kafka server.log file contains:
INFO [GroupCoordinator 1]: Group OX-AM-OPS-cg-ss with generation 3 is now empty (kafka.coordinator.GroupCoordinator)
and
INFO [Group Metadata Manager on Broker 1]: Group OX-AM-OPS-cg-ss transitioned to Dead in generation 3 (kafka.coordinator.GroupMetadataManager).
This behavior is easily replicated on our installation.
We overcome the problem by inserting this piece of code:
ruby {
code => "event.cancel if (Time.now.to_f - event.get('@timestamp').to_f) > (60 * 60 * 24 * 5)"
}
as suggested in: https://stackoverflow.com/questions/30087807/ignore-incoming-logstash-entries-that-are-older-than-a-given-date
Nevertheless I think that this kind of event should not lead to pipeline hanging in consuming.
Thanks for your attention,
Marco
It may be possible to use a dead letter queue to capture these events, but I am not sure exactly which return codes that are captured by default.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.