Throttling processing of log backfill using Filebeat

(Greg T) #1

[Originally posted this on the Elasticsearch forum (Throttling processing of log backfill), but it was suggested that it was more of a Filebeat or Logstash question. Sorry for the cross posting.]

I've done my own research on this, but have not yet found a clear solution. Hoping someone can advise.

We have a very simple single-node ELK server and a client with
Filebeat (Filebeat -> LS -> ES). We don't anticipate a high volume
(even in Production).

Problem is, when initially starting the server, we want to "backfill"
it with a few months worth of logs (say 600+ 1MB daily log files of
various types). Filebeat takes off running, loading as many harvesters
as it can, and floods the ELK server as if there's no tomorrow. LS seems
to keep up OK, but ES gets overwhelmed pretty quickly (returning 429 /
"rate limiting" errors constantly during the backfill operation). Though
it appears LS will keep trying until successful, I've seen evidence
that a lot of messages are getting lots (and never making it into ES).

On the one hand we could attempt to size and configure the server to
support this initial flood, but that doesn't seem appropriate (since
this is a once-off operation; if it takes a few hours to catch up, no
big deal).

How can we safely process a significant "backlog" of files -- once
off -- on a modest server, having the various components "throttle"
traffic to prevent overwhelming ES (which seems to result in errors and missing documents).



(Jordan Sissel) #2

Losing messages is a bug, not a feature. How are you observing this data loss?

(Andrew Cholakian) #3

I'll add that 429 errors are expected as part of normal operation. They aren't really errors, they're more of a signal. The ES output will backoff exponentially. There's a reason they're logged at the info level, and not at the error level.

I'm actually contemplating moving them to the debug error level.

If you are losing data then, as @jordansissel mentioned, that is a bug.

(Andrew Cholakian) #4

Oh, it looks like it's logging as error now. My mistake. I thought we'd changed that, it may be a regression:

(Greg T) #5

Thanks guys.

As for evidence of loss, I can't say for 100% sure, but this is what I'm seeing:

I had previously had my server process all the logs from a particular client. This was when ES was configured with 5 shards per node. Performance was not good, and I realized that 1 shard was probably much more appropriate for my situation. Also, I had been using separate indexes for separate log file types, and decided to combine them all into one index per day (rather than one per log type per day). Prior to making these changes I queried for a list of the indexes (along with document counts). Then I deleted all indexes and the Filebeat registry and ran the whole process again with the new settings. It was during this run that I noticed all the 429 errors. Also, after it completed I ran the index query again and did some comparisons of the results using Excel. On average the new run has about 8% FEWER documents than the previous run (with the lower document counts spread out across all indexes). There may be some other reason for this, but given the 429 errors, I figured the two issues might be related. Definitely open to suggestions. Thanks again.

(Andrew Cholakian) #6

@gtorrance are there other errors in your logstash log aside from the 429 exception?

BTW, we'll release a patch soon to reduce the log severity of 429 errors. You can track that here:

(Greg T) #7

Thanks @Andrew_Cholakian1.

I'm not seeing anything much else in our Logstash logs except an absolute FLOOD of statements like the following. Usually many of the first line, then one of the second, then many of the third.

[2017-04-25T07:51:36,784][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$7@468fe101 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@52a47e68[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 745231]]"})
[2017-04-25T07:51:36,784][ERROR][logstash.outputs.elasticsearch] Retrying individual actions
[2017-04-25T07:51:36,784][ERROR][logstash.outputs.elasticsearch] Action

Our Logstash logfile for yesterday (when I did this test) is 163MB (of the above lines). Thanks.

(Steffen Siering) #8

Have you also had a look at filebeat logs? Instead of data-loss, Another reason for document count difference could be duplicates. E.g. filebeat is resending events on connection loss.

Any clue if 429 is reported/handled on batch level or per event?

Receiving 429 from ES indicates we're trying to send too many events much to fast, but instead of having some kind of back-pressure or rate-limiting we're currently forced to retry (wasting CPU and network bandwidth). Maybe this can be mitigated somewhat by limiting the bandwidth on the LS input side (Use OS capabilities to limit bandwidth) and/or reduce the batch sizes and number of workers for sending to ES. Settings flush_size and worker, to not over-saturate the the queues in ES.

Beats has a default batch size of 50 events with 1 worker only. It's not much and can be ramped up a ton on bigger (per performant) setups. But with very small ES setups (in VM on my laptop) I found some events getting 429 at ~60 events per batch.

HTTP transactions can introduce quite some latency, so to increase bandwidth you might want to have some concurrency in the output with a little smaller batch sizes. This implicitly might generate some more backpressure on LS and ES.

Well, just some thoughts.

Anyway, some ideas figuring out if the discrepancy is due to duplicates and/or data-loss would be interesting. Totally agree, data loss from filebeat or LS is a bug. Maybe one can start by collecting total number of documents being expected (wc -l all log files) and logs present. Piping events from filebeat to console (or Logstash via pipe output), we can also count the number of events seen by each. In LS we could also try to hash events to check for duplicate in ES.

(Greg T) #9

Thanks for the reply, @steffens!

Not seeing anything unusual in the Filebeat logs.

I assume a number of your comments are directed at the ELK experts. I’ll leave those alone.

Lots of good ideas for analyzing the count discrepancies. Thanks. I tried some of what you suggested, and re-ran some of the processes. Here’s what I found:

  • I re-ran the full Filebeat -> ELK process with 1-shard indexes. The counts for each index were identical to the previous 1-shard run. Good news!

  • I did the wc -l test on each log file. Counts were quite a bit higher, but I realized this is likely due to multi-line messages (resulting from exceptions in the logs). Probably not an issue.

  • I then ran some tests writing JSON directly to a file from Filebeat. I tested this per month and tested some individual files. (It was fairly manual, though, so I didn’t test every individual file.) Counts matched exactly with the 1-shard tests above. Great news!

  • I then ran the full Filebeat -> ELK process with 5-shard indexes for all indexes in one particular month (to save some time). Oddly, counts for each index were between 59 and 83 messages LARGER than the equivalent 1-shard indexes. This amounts to, on average, about a 0.3% increase in messages vs. the 1-shard indexes. Any theories to explain this?

All of the above runs resulted in indexes with significantly smaller counts than the original 5-shard run (that was referenced in my original post). I can only assume my earlier 5-shard test was flawed in some way or other (since the results are so different from what I’m seeing now). I was still learning the product at that point, making many changes, stopping and starting constantly, etc. The data must have just been messed up somehow. That’s all I can think.

I'm happy to see the consistency in results in recent runs -- even with all the 429 errors.

The question does remain, though, about the 1-shard vs. 5-shard difference (59 – 83 message per index). Thoughts?

Thanks to everyone for the help with this! (Sorry if my likely-bad initial data caused confusion.)


(Steffen Siering) #10

Unfortunately I can't recall how exactly 429 works inside ES. Reason I'm asking for 429 on request vs event leven is: Having multiple nodes, the node receiving the Batch will stream-parse the HTTP body and re-route the events (while parsing rest of the body) to the other nodes for indexing. As A single nodes queue might be full at some given point of view, a random number of events might fail (sometimes in very small bursts). Now comes the part that is unclear to me, but depending on how the 429 error code is generated, there still might be a chance the document being indexed while receiving a 429. It's a very long time since I've seen this, so I really can't recall whats happening here :frowning: But that's another reason I'm asking for duplicates.

A simple way to find duplicates is having an LS processor adding an event id (use keyword type in ES). If events do have timestamps and messages are unique within a time-range of potential consecutive timestamps (no "duplicates" in 1-index case), key can be just the raw log-message itself. Then you can use the term filter in ES (Kibana discover already shows you top-N) to find top N duplicate events. Alternatively write custom script fetching from ES and remove duplicates. Subtracting the duplicates gives you the total number of unique events. If unique events still match the original event count (minus exact duplicates in original logs), FB->LS->ES do work well in limits of send-at-least-once semantics.

(Greg T) #11

Thanks for the reply, @steffens. As I've been doing further testing, though, I can't seem to duplicate the problem. It's very strange (as I know I was getting unusual results earlier). Without being able to duplicate the problem, I don't think there's much we can do at this point.

Thanks for taking the time to help out with this, though. I do appreciate it!

(system) #12

This topic was automatically closed after 21 days. New replies are no longer allowed.