Filebeat does not re-send logs if they're rejected because of strict mapping

(Vladimir Kazakov) #1

Filebeat: 5.3.2 (64-bit).
OS: Ubuntu 16.04.2 (64-bit).

I have an index template with dynamic set to strict. According to the documentation, Filebeat must not be able to send logs to Elasticsearch if there's a field that is not specified in the template.

This works fine and Filebeat receives HTTP 400 (client.go:442: WARN Can not index event (status=400): {"type":"strict_dynamic_mapping_exception","reason":"mapping set to strict, dynamic introduction of [beat] within [log] is not allowed"}), but the problem is that Filebeat doesn't try to re-send such logs. So now, when I found that the template changed, I want to update it and re-send logs that weren't sent, but Filebeat is not doing anything even when I restart it.

Is there a way to solve this problem? Or it's Filebeat's issue?

(Carlos PĂ©rez Aradros) #2

Hi @vladimir-kazakov,

I would say this is common behavior. but if you want to reset filebeat to send everything back again you can!.

Normally filebeat stores current status under data/ folder, you will find a registry file there, filebeat uses it to remember what logs have been sent already. By removing that folder you will reset the registry status, making filebeat send everything again.

For more information, take a look to

Best regards

(Vladimir Kazakov) #3

This is not expected behavior. In the documentation - - it's clearly stated that: In situations where the defined output is blocked and has not confirmed all events, Filebeat will keep trying to send events until the output acknowledges that it has received the events.

In other words, if logs are being rejected by Elasticsearch, Filebeat must keep trying to send them until they're accepted. Deleting the registry file is not a solution, because the state of already sent logs will be lost.

(ruflin) #4

I agree that if we get a 400 error, filebeat should resend the events and I think it does. @steffens Can you comment on this one?

@vladimir-kazakov Could you share the log output from filebeat? Every 30s an info line is printed on how many events are acked and not acked. Would be interested to see these stats.

(Steffen Siering) #5

Here is the per item error handling:

Depending on the error code, the event is dropped (so beats can progress) or retried. Beats assume the condition for the failed event not being "resolvable". So in need to progress, the event is dropped/not indexed. I think logstash also behaves this way.

Unfortunately we don't have a dead letter queue for non-indexible events in beats.

(Vladimir Kazakov) #6

@ruflin, here's the output from Filebeat (sensitive data is replaced with [...]).

2017/05/10 05:59:39.497997 crawler.go:58: INFO Loading and starting Prospectors completed. Enabled prospectors: 1
2017/05/10 05:59:39.498018 metrics.go:23: INFO Metrics logging every 30s
2017/05/10 05:59:39.498084 sync.go:41: INFO Start sending events to output
2017/05/10 05:59:39.498159 registrar.go:236: INFO Starting Registrar
2017/05/10 05:59:39.498205 spooler.go:63: INFO Starting spooler: spool_size: 2048; idle_timeout: 5s
2017/05/10 05:59:39.498467 log.go:91: INFO Harvester started for file: /[...]/2017-05-04-v1.json
2017/05/10 05:59:44.499389 client.go:214: DBG  Publish: { [...] }
2017/05/10 05:59:44.499468 output.go:109: DBG  output worker: publish 1 events
2017/05/10 05:59:44.535083 client.go:658: INFO Connected to Elasticsearch version 5.3.2
2017/05/10 05:59:44.535119 output.go:301: INFO Trying to load template for client: http://localhost:9200
2017/05/10 05:59:44.537246 output.go:308: INFO Existing template will be overwritten, as overwrite is enabled.
2017/05/10 05:59:44.602787 client.go:588: INFO Elasticsearch template with name 'log-v1' loaded
2017/05/10 05:59:44.812186 client.go:442: WARN Can not index event (status=400): {"type":"strict_dynamic_mapping_exception","reason":"mapping set to strict, dynamic introduction of [version] within [beat] is not allowed"}
2017/05/10 05:59:44.812225 sync.go:70: DBG  Events sent: 2
2017/05/10 06:00:09.498706 metrics.go:39: INFO Non-zero metrics in the last 30s: filebeat.harvester.open_files=1 filebeat.harvester.running=1 filebeat.harvester.started=1 libbeat.publisher.published_events=1 registrar.states.current=1 registrar.states.update=2 registrar.writes=2
2017/05/10 06:00:39.498692 metrics.go:34: INFO No non-zero metrics in the last 30s

So, when the event is being rejected, Filebeat incorrectly thinks that there's no need to re-send the rejected event. If there's no new events, Filebeat reports "no non-zero metrics" every 30 seconds, without doing anything else. The piece of code referenced by @steffens proves it: if the status code is 400 (the event is rejected), it's assumed that everything is okay and the event is published, which, of course, is not true.

In my opinion this is wrong, because rejected events must not be forgotten. They're important. If something changed in the mapping, then it's an indication that Filebeat is configured incorrectly. So, as a user of Filebeat, I must be able to fix the configuration and successfully publish events that were previously rejected. Currently, I cannot do it, and the only option for me is to switch from the strict mapping to non-dynamic. In this case, only newly added fields will be ignored. This is not a perfect solution, but it's better than completely losing the whole event.

@steffens, is there a good reason why Filebeat doesn't re-send events in HTTP 400 case? It behaves correctly when there's no connection with Elasticsearch (I'm not sure which HTTP status code is used in this case). Why it can't do the same for HTTP 400? The same could be related to HTTP 401, when authentication credentials are changed. HTTP 4xx errors are the client errors, which means that Filebeat is doing something wrong. In this case, it's clear that no events can be sent, so Filebeat must not treat them as already sent. It seems like retrying on client errors would be a more correct behavior.

(Steffen Siering) #7

I think there is a slight miss-understanding of status codes in bulk API here.

When using bulk API, there are 2 kind of status codes. First is the HTTP transaction status code valid for the complete bulk. This is handled here:

That is, if you get a 4xx on HTTP request level, all events will be retried until success. As all responses not '200 OK' are assumed elastisearch is not available/setup correctly. This covers your wrong-credentials-usecase.

Next, in a bulk request you get a status code for each single item being inserted. If we get a 5xx or 429 for an item, beats do assume some internal server/node error, the server needs to recover from. How do you recover from 400 per item status code in a life elasticsearch instance? There might be other errors like type mismatch we can not easily recover from for example. Here a dead-letter queue (e.g. an alternate index?) might help to still store the event somewhere for inspection and reindexing.

(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.