Slow indexing speed, possibly related to filebeat misconfiguration

I have actually no idea which tag/subforum to use, because I have an issue with a general filebeat -> logstash -> elasticsearch pipelinem and I am not entirely sure, is this issue related to ES indexing performance, or is it logstash, filebeats, general network etc.
I've picked "beats" simply because that's my best guess so far.

Short description:
New ELK cluster (docker containers on one shared dedicated server) shows kinda low indexing performance, around 4-5k events/s. "Old cluster", which consisted of a couple of VMs with lower total resources, has pretty much the same configuration and shows kinda the same performance (I expected new cluster to perform significantly better).

Anything I've tried/tested/checked so far didn't help to improve the performance.

Details:
New ELK cluster: (if I mention just "ELK", "ELK stack" or "server" further into my message - I'm talking about new one)

  • version: 8.6.2
  • total resources: 24 CPU, 64 GB RAM

Old ELK cluster:

  • version: 7.17.9
  • total resources: 8 CPU, 16 GB RAM (ES + Kibana) + 4 CPU, 8 GB RAM (Logstash)

Filebeats:

  • version: 7.17.*

Configuration was written by me in both cases (new ELK stack basically copies configuration of old stack + some modifications because of 1) transition to docker; 2) major release update).

From server's perspective, as far as I can tell, it's not struggling:

  • total CPU utilization lies between 30% and 70% on average
  • there is free RAM available and there is no swap
  • I can't see any iowait time
  • network bandwidth usage is around 5-10 Mbps on average
  • also, I've tried to check some indexing pressure metrics - I haven't seen any "*_rejections" fields above zero

Because of that, I understand, it's probable that there are no problems with ELK stack, and I should check my filebeats for any misconfigurations or check network related-issues.

Talking about network - I have 20 logstash filebeat inputs opened, just for the sake of "being sure there are no buffers/queues overloaded etc". Every filebeat instance has all of those 20 ports in it's config.

Currently filebeat output configuration looks like this:

output:
  logstash:
    workers: 6
    pipelining: 3
    hosts:
      - ...

I have tried raising both workers and pipelining - no difference. I've also tried raising bulk_max_size both in filebeat and logstash configuration - also no difference.

The only "successfull attempt" I've got so far is trying to use

loadbalance: true

in filebeat configuration (docs reference) - it did, indeed, improve indexing speed from 4k/s to something like 5k/s .

But:

  1. I dislike the way that option works: all of my filebeat instances start to use all ports simultaneously - it's opposite of what "load balancing" means to me :slight_smile: Because, I think, port/socket/buffer "overloads" are much more probable in that case.
  2. Still, it doesn't solve the main issue. Even having 5k/s, I didn't see server struggling, and I would like to get better performance.

Except for that, I've also tried some other things, like:

  • increasing workers number both for filebeat and logstash
  • increasing/decreasing compression level in filebeat config
    (with no success, obviously)

Questions:

  1. The main one: any ideas what can be wrong in my configuration, i.e. to locate "where the bottleneck is"? Any tips/recommendations are appreciated, including any stats/metrics to check.
  2. Since I've already noticed, that "loadbalance" in filebeat configuration kinda helps - any way to reach the same result without that exact option?

Are you falling behind in log processing here?

Well, I've found this, this and this being relevant enough. Still no real success, unfortunately.

I tried to find the bottleneck by measuring throughput at different parts of a "pipeline", and:

  1. Filebeat alone (using "output.console") is able to read my logs locally at 25k/s rate.
  2. When I use remote output, no matter if it's logstash (even with all the filters commented out) or elasticsearch - rate drops to the same 4k-5k/s. When I mention "filebeat with logstash output here", I mean logstash "stdout" output plugin, measuring it's performance both with "pv" and with filebeat internal monitoring.

I also tried to use some "dummy log" as a data source (randomly generated file with lines much shorter than I have in my original files) - filebeat alone started processing it at 45k/s rate, but, using remote logstash/elasticsearch, rate increased only to 6k/s.

Since the main idea of the links I mentioned relates to bulk/batch sizes and workers count (both for filebeat and logstash) - I've also tried to play with them.
Right now, my logstash has in it's pipelines.yml :

- pipeline.id: "pipeline1"
  path.config: "/usr/share/logstash/pipeline/pipeline1"
  pipeline.ordered: false
  pipeline.workers: 96
  pipeline.batch.size: 1024

And filebeat.yml:

queue.mem:
  events: 600000
  flush.min_events: 512
  flush.timeout: 5s
# ...
output:
  logstash:
    loadbalance: true
    workers: 12
    pipelines: 4
    bulk_max_size: 1024
    hosts:
      - <20 different ports of my logstash here>

By my logic, this should be much more than enough - but no. Looking at logstash flow stats :

...
    "worker_concurrency" : {
      "current" : 8.992,
      "last_1_minute" : 5.544,
      "last_5_minutes" : 5.699,
      "last_15_minutes" : 7.645,
      "last_1_hour" : 8.096,
      "lifetime" : 8.02
    }
...

- it's not even close to the number of available workers from both sides.

What I have also noticed: when I made some tests to measure rates between filebeat and logstash (using one instance of filebeat with multiple ports of the same logstash) - I was able to reach something like 5k-6k/s rate. But now, when I use multiple filebeats - total logstash rate is the same, 5k-6k/s for all filebeats combined.
So, by logic, it should mean that bottleneck is either network or logstash itself (or the physical node where it's hosted). But:

  1. Logstash "flow stats" (mentioned above) show, that Logstash workers are far from saturated
  2. I have tested network between one filebeat instance and logstash instance with iperf, and bandwidth easily reached 500 Mbps.

So far, it seems to me, that both filebeat and logstash are doing pretty fine alone (and elasticsearch is out of the question because even "filebeat->logstash" test (without ES) shows poor performance), but when I try to combine them - something is not working out. Like filebeat just doesn't want to send data fast enough for some unknown reason.

Sorry, I didn't see your message at first.

Yes, I am. I have a couple of logs, which are generated faster than filebeat is able to send them to logstash.

I have managed to find an issue: I had queue.type: persisted enabled for my Logstash. After switching back to queue.type: memory, my issue is gone. Looks like I have underestimated it's impact on i/o and overestimated my diagnostic skills :slight_smile: (at least, related to i/o performance)

Anyway, I've got many interesting ideas in slack channel (thread link, if you are lucky enough to check it before it will be automatically deleted in 90 days).

Below are some of the most interesting parts - maybe they'll help someone at some point.


Marius Iversen (Elastic Software Engineer)
When you say that you have multiple Logstash ports, are you pointing to the fact that you have a single logstash with multiple listening ports, or actually multiple Logstash'es running?

alexandrpaliy
I mean single logstash with multiple listening ports. Did not try multiple logstash instances.

Marius Iversen (Elastic Software Engineer)
More ports will just mean more network stack to keep the state for

Marius Iversen (Elastic Software Engineer)
You don't really need multiple ports to handle multiple workers

alexandrpaliy
About multiple ports: I was worried that I can somehow overload network port (Idk, some internal linux limits for TCP port/socket), that's why I decided to use multiple ports.
I can guess, that multiple ports will increase Logstash RAM usage a bit, but do you really think that can also cause some [significant] performance decrease?

Marius Iversen (Elastic Software Engineer)
That should not really be a problem, in most cases that would rather be on the interface itself instead

Marius Iversen (Elastic Software Engineer)
I wouldn't say for sure, since we have not really configured it that way in any cases before, but if its not needed, we should also eliminate the possibilities first :slightly_smiling_face:

Marius Iversen (Elastic Software Engineer)
How many nodes do you have in the Elasticsearch cluster btw?

alexandrpaliy
3. I have 2 disks: SSD and HDD. I use one data_hot node for SSD and one data_cold node for HDD, and the 3rd (or, better, "the very first") node is my master node, just to control other two :slight_smile:

Marius Iversen (Elastic Software Engineer)
Hmm, usually you would want all 3 to be masters and data at the same time, though a bit different since you have one hot and one cold

Marius Iversen (Elastic Software Engineer)
That is not going to impact your ingest though

Marius Iversen (Elastic Software Engineer)
Gotcha, so here is the list in my mind, and I don't expect you to do any/all of them, its totally up to you, and no need to rush it :slight_smile:

  1. Setup a new metricbeat (8.6+ same as your new cluster version), add K8 monitoring and Elasticsearch monitoring to it, this alone in many cases could point you to the bottleneck. I believe you can simply spin this up in another container.
  2. Setup a second Logstash on the same server, since its just a container it should be pretty straightforward I presume?
  3. Use only 1 port on each logstash, drop worker count significantly (feel free to use the same amount as CPU cores to start)
  4. Configure filebeat with 6 workers, keep loadbalance to true, also pipelines is wrong, I think its called pipelining (https://www.elastic.co/guide/en/beats/filebeat/8.6/logstash-output.html#_pipelining)
  5. Use filestream input instead of log on filebeat: (filestream input | Filebeat Reference [7.17] | Elastic)
  6. If you can test with a 8.x filebeat instead that is nice, it has plenty of improvements for filestream (and log state in general).
  7. Filebeat and logstash monitoring is also an option if you want (using metricbeat).
  8. Check the metrics logged to filebeat every 30 seconds, especially around the queue count, does it increase?
  9. If queue count increase on #8, check similar logstash stats.
  10. If logstash queue count is also going up, then its either ES output on Logstash or ES itself which is the problem.
  11. You can use Rally to benchmark your new ES cluster, the "track" that you want is called elastic/logs https://esrally.readthedocs.io/en/latest/race.html
  12. Might be good to confirm that the cold node does not have ingest role
  13. Could reconsider more smaller containers, so that you can have 3 hot nodes that also handle ingest and master (but that is just a thought at the moment).
  14. If you are using custom logs only I presume you also only have a custom data-stream configured? What is the refresh interval and primary/replica count on it? A slightly higher refresh rate is better for ingest (10 seconds for example, or even 1 minute if its okay)

Marius Iversen (Elastic Software Engineer)
(...) starting with #1 is what I would do at least, you can always delete that data afterwards if you are not interested in it

Marius Iversen (Elastic Software Engineer)
Also if you are on K8 I completely forgot, do you use ECK?

alexandrpaliy
Thank you very much, I'll try it. A couple of questions:
0 (general): Initially I thought it's some kind of "plan" of "sequential steps". Now that I re-read it, it looks more like just a set of different "measures" to test/compare - is that correct?
1 (and your latest message): No, I don't use k8s at all, I use pretty simple docker-compose to handle my ELK stack at the moment.
7 (and partially 1): I have tried to use internal monitoring (which is considered deprecated as of now, probably :slightly_smiling_face: but which is kinda simpler/faster to set up) for both filebeat and logstash. But I'll try metricbeat, np.
12: No, cold node definitely doesn't have ingest role:
$ cat /opt/docker/docker-compose.yml | grep -E '(^ es-\S+:$)|roles'
es-master:
- node.roles=[master, remote_cluster_client]
es-data1:
- node.roles=[data_content,data_hot,data_warm]
es-data2:
- node.roles=[data_cold,data_frozen]
14: No, I think I don't use any custom data-streams (simply because I'm not even sure what does it mean :stuck_out_tongue: ). I didn't change any "default" elasticsearch settings, except for setting "number_of_replicas": "0" in ES index template.

Marius Iversen (Elastic Software Engineer)
0: It is just different measures correct

Marius Iversen (Elastic Software Engineer)
7: As long as you use docker-compose then that should be enough for metricbeat, instead of K8 module you can use the docker module, it will be nice to see some container stats from them as well

Marius Iversen (Elastic Software Engineer)
12: Stack monitoring requires ingest pipelines, so at least data1 needs to have ingest role. There is no "hot" role anymore either, so you can remove that one

Marius Iversen (Elastic Software Engineer)
Stats that is interesting, especially if you also collect from Logstash/Filebeat is:
Filebeat: Queue size, ingest rate + ack rate should be similar, they are shown in the dashboard.
Logstash, Queue size, ingest rate + ack rate as well if its available.
Elasticsearch, Most stats there is going to be useful, and it will show you if it has problems with any of the current container resources.
Docker Stats: I/O stats for sure is the main one, CPU threads as well if its available

Marius Iversen (Elastic Software Engineer)
There are more as well, but I am writing this from the top of my head, I don't have the stats in front of me :stuck_out_tongue:

Marius Iversen (Elastic Software Engineer)
Also as a 14th number: Check the configured index refresh rate in that index template you modified, its called refresh_interval: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html#_unset_or_increase_the_refresh_interval

Marius Iversen (Elastic Software Engineer)
if you have not configured the one above, that alone will give you a big boost

Marius Iversen (Elastic Software Engineer)
set it to something like 1m or 5m. (The setting is the lag, from when the data is ingested, to when its searchable), so use a value you are okay with

Marius Iversen (Elastic Software Engineer)
You can always decrease them later again if it works for you :slight_smile:

Dain (Elastic Security PME)
Just to pile on - if I understand correctly the issue happened when configured for filebeat -> network -> logstash (console output)? if so i would definitely check the network in detail as well - e.g. a tcpdump checked with wireshark - look for packet loss, tcp window reductions, retransmits, etc

alexandrpaliy
Yes, so far "slowness" appears when network is involved. I have tried to test general "bandwidth" between my filebeat and logstash/elasticsearch nodes with iperf3 - it went easily up to 500 Mpbs (and when filebeat sends data to LS/ES, it hardly reaches 5-10 Mbps). There are also no complex firewalls in front/behind those servers, so I have no reasons to blame network itself so far.

Marius Iversen (Elastic Software Engineer)
If the issue is network between Filebeat - Logstash you should see the queue build up on Filebeat but not on Logstash

Marius Iversen (Elastic Software Engineer)
Similar with Logstash - ES

Marius Iversen (Elastic Software Engineer)
You could start with just looking at the stats in that case first

Marius Iversen (Elastic Software Engineer)
@Dain (Elastic Security PME)
What about also having all 3 nodes as master, two of them for hot + ingest, and second for warm (no cold/frozen for now) ? A bit unsure if that would help

Marius Iversen (Elastic Software Engineer)
Cold/Frozen roles is really only useful if you have object storage (S3, GCS etc) or minio on-prem)

alexandrpaliy
Hmm, I was definitely checking all the metrics available (when turned on legacy "internal" monitoring for filebeat and logstash), but I don't really remember anything about queues there (because, I understand, that's the very first thing I should check in my situation). Maybe I didn't enable something, or maybe that's exactly why internal monitoring is considered deprecated - I'll check what metricbeat will show me, thank you.

Marius Iversen (Elastic Software Engineer)
You could always start with the metricbeat on the server running docker first if you want, as its the easiest one to configure with docker-compose

Marius Iversen (Elastic Software Engineer)
At least to get the docker + ES stats

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.