Could filebeat be dropping messages here?

Sarfaraz_Ahmad · December 20, 2017, 1:31pm

Hi,

I have filebeat running on 10 hosts that send logs to a central logstash.

I notice these messages on all of them. The time on different hosts does not coincide suggesting there is no specific window for these failures.

2017-12-20T05:30:57+05:30 ERR Failed to publish events caused by: read tcp 10.240.172.68:49774->10.219.27.74:5536: i/o timeout 2017-12-20T05:30:57+05:30 INFO Error publishing events (retrying): read tcp 10.240.172.68:49774->10.219.27.74:5536: i/o timeout 2017-12-20T05:31:19+05:30 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.read_bytes=54 libbeat.logstash.published_but_not_acked_events=103 libbeat.logstash.call_count.PublishEvents=6 libbeat.publisher.published_events=941 publish.events=10836 libbeat.logstash.publish.write_bytes=49297 registrar.writes=6 libbeat.logstash.published_and_acked_events=1044 registrar.states.update=10836

and
2017-11-15T05:17:49-05:00 ERR Failed to publish events caused by: write tcp 10.219.26.81:54940->10.219.27.74:5536: write: connection reset by peer
2017-11-15T05:17:49-05:00 INFO Error publishing events (retrying): write tcp 10.219.26.81:54940->10.219.27.74:5536: write: connection reset by peer
2017-11-15T05:17:59-05:00 INFO Non-zero metrics in the last 30s: registrar.states.update=88064 libbeat.logstash.publish.write_bytes=195420 libbeat.publisher.published_events=4602 registrar.writes=43 libbeat.logstash.call_count.PublishEvents=44 libbeat.logstash.publish.read_bytes=270 publish.events=88064 libbeat.logstash.publish.write_errors=1 libbeat.logstash.published_but_not_acked_events=82 libbeat.logstash.published_and_acked_events=4602
2017-11-15T05:18:29-05:00 INFO Non-zero metrics in the last 30s: registar.states.current=-1 libbeat.logstash.published_and_acked_events=4680 registrar.states.update=75776 registrar.writes=37 libbeat.logstash.publish.write_bytes=196788 libbeat.publisher.published_events=4680 libbeat.logstash.publish.read_bytes=228 publish.events=75776 registrar.states.cleanup=1 libbeat.logstash.call_count.PublishEvents=37

Also my logstash server is not heavily loaded. I don't see it run out of any cpu/mem/disk resources.

Here is my filter in logstash,

filter {

if ([type] == "named-externalqueries") {
   grok {
        match => [ "message", "(?<parsedtime>%{MONTHDAY}-%{MONTH}-%{YEAR} %{TIME}) queries: info: client %{IPORHOST:clientIP}#%{NUMBER:clientPort:int}%{SPACE}\(%{DATA:queryName}\): query: %{DATA:queryName2} %{WORD:queryClass} %{WORD:queryType} (?<recursive>[+-])(?<queryFlags>[SETDC]*) \(%{IPORHOST:nameserver}\)", "message", "(?<parsedtime>%{MONTHDAY}-%{MONTH}-%{YEAR} %{TIME}) queries: info: client %{IPORHOST:clientIP}#%{NUMBER:clientPort:int}%{SPACE}\(%{DATA:queryName}\): view %{WORD:queryView}: query: %{DATA:queryName2} %{WORD:queryClass} %{WORD:queryType} (?<recursive>[+-])(?<queryFlags>[SETDC]*) \(%{IPORHOST:nameserver}\)" ]
    }
    ruby {
        code => "
            if !event.get('queryFlags').to_s.empty?
                if event.get('queryFlags').include? 'S'
                    event.tag('queryFlags_signed')
                end
                if event.get('queryFlags').include? 'E'
                    event.tag('queryFlags_edns0')
                end
                    if event.get('queryFlags').include? 'T'
                    event.tag('queryFlags_tcp')
                end
                if event.get('queryFlags').include? 'D'
                    event.tag('queryFlags_dnssec')
                end
                if event.get('queryFlags').include? 'C'
                    event.tag('queryFlags_dc')
                end
            end
            if event.get('queryType').include? 'PTR'
                ip=event.get('queryName').match(/.*?((?:[0-9]{1,3}\.){4}).*/)[1].chomp('.').split('.').reverse.join('.')
                event.set('queryIP',ip)
            end
        "
    }
    if [queryType] =~ "PTR" {
        cidr {
            add_tag => [ "_internal_ptr_lookup" ]
            address => [ "%{queryIP}" ]
            network => [ "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16" ]
            }
    }
    if "_internal_ptr_lookup" in [tags] {
        drop {}
    }
    # At times, query name contains internal ip addresses as well (domains are already dropped in filebeat), drop these
    if [queryName] =~ /^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/ {
        cidr {
            add_tag => [ "_internal_lookup" ]
            address => [ "%{queryName}" ]
            network => [ "10.0.0.0/8", "172.16.0.0/12", , "192.168.0.0/16" ]
        }
    }
    if "_internal_lookup" in [tags] {
        drop {}
    }

}

I think I am loosing some events from Filebeat. Do these messages confirm the same ?
Is there a document detailing what these mean ?
libbeat.logstash.publish.read_errors
libbeat.logstash.published_but_not_acked_events
registar.states.current=-1

Please assist.

Regards,
Ahmad

ruflin · December 28, 2017, 1:51am

Filebeat by default has an at least once delivery. So if there are some errors when sent to LS, FB will retry. Do you see any errors on the LS side? Something is resetting the connection which could also be a load balancer or a flaky network connection.

In your LS config you drop some of the events, these will definitively be lost. I didn't check the LS config in detail but it could be that some filters are slow?

Please provide also your Filebeat config and version of Logstash / FIlebeat you are using.

system · January 10, 2018, 1:31pm

This topic was automatically closed after 21 days. New replies are no longer allowed.

Topic		Replies	Views
Error on filebeat -ERR Failed to publish events caused by: read tcp Beats filebeat	6	7800	July 17, 2017
Filebeat reports :Error publishing events (retrying): read tcp IPA->9IPB5044: i/o timeout, why? Beats	7	2215	July 5, 2017
ERR Failed to publish events caused by: read tcp (filebeat Version: 5.5.1) Beats filebeat	5	4940	November 13, 2017
ERR Failed to publish events caused by: read tcp IP:40634->IP:5044: i/o timeout Beats filebeat	7	10505	July 23, 2017
Error publishing events (EOF, broken pipe, i/o timeout, connection reset) Beats filebeat	7	3943	December 26, 2016

Could filebeat be dropping messages here?

Related topics