I have managed to find an issue: I had queue.type: persisted
enabled for my Logstash. After switching back to queue.type: memory
, my issue is gone. Looks like I have underestimated it's impact on i/o and overestimated my diagnostic skills (at least, related to i/o performance)
Anyway, I've got many interesting ideas in slack channel (thread link, if you are lucky enough to check it before it will be automatically deleted in 90 days).
Below are some of the most interesting parts - maybe they'll help someone at some point.
Marius Iversen (Elastic Software Engineer)
When you say that you have multiple Logstash ports, are you pointing to the fact that you have a single logstash with multiple listening ports, or actually multiple Logstash'es running?
alexandrpaliy
I mean single logstash with multiple listening ports. Did not try multiple logstash instances.
Marius Iversen (Elastic Software Engineer)
More ports will just mean more network stack to keep the state for
Marius Iversen (Elastic Software Engineer)
You don't really need multiple ports to handle multiple workers
alexandrpaliy
About multiple ports: I was worried that I can somehow overload network port (Idk, some internal linux limits for TCP port/socket), that's why I decided to use multiple ports.
I can guess, that multiple ports will increase Logstash RAM usage a bit, but do you really think that can also cause some [significant] performance decrease?
Marius Iversen (Elastic Software Engineer)
That should not really be a problem, in most cases that would rather be on the interface itself instead
Marius Iversen (Elastic Software Engineer)
I wouldn't say for sure, since we have not really configured it that way in any cases before, but if its not needed, we should also eliminate the possibilities first
Marius Iversen (Elastic Software Engineer)
How many nodes do you have in the Elasticsearch cluster btw?
alexandrpaliy
3. I have 2 disks: SSD and HDD. I use one data_hot node for SSD and one data_cold node for HDD, and the 3rd (or, better, "the very first") node is my master node, just to control other two
Marius Iversen (Elastic Software Engineer)
Hmm, usually you would want all 3 to be masters and data at the same time, though a bit different since you have one hot and one cold
Marius Iversen (Elastic Software Engineer)
That is not going to impact your ingest though
Marius Iversen (Elastic Software Engineer)
Gotcha, so here is the list in my mind, and I don't expect you to do any/all of them, its totally up to you, and no need to rush it
- Setup a new metricbeat (8.6+ same as your new cluster version), add K8 monitoring and Elasticsearch monitoring to it, this alone in many cases could point you to the bottleneck. I believe you can simply spin this up in another container.
- Setup a second Logstash on the same server, since its just a container it should be pretty straightforward I presume?
- Use only 1 port on each logstash, drop worker count significantly (feel free to use the same amount as CPU cores to start)
- Configure filebeat with 6 workers, keep loadbalance to true, also pipelines is wrong, I think its called pipelining (https://www.elastic.co/guide/en/beats/filebeat/8.6/logstash-output.html#_pipelining)
- Use filestream input instead of log on filebeat: (filestream input | Filebeat Reference [7.17] | Elastic)
- If you can test with a 8.x filebeat instead that is nice, it has plenty of improvements for filestream (and log state in general).
- Filebeat and logstash monitoring is also an option if you want (using metricbeat).
- Check the metrics logged to filebeat every 30 seconds, especially around the queue count, does it increase?
- If queue count increase on #8, check similar logstash stats.
- If logstash queue count is also going up, then its either ES output on Logstash or ES itself which is the problem.
- You can use Rally to benchmark your new ES cluster, the "track" that you want is called elastic/logs https://esrally.readthedocs.io/en/latest/race.html
- Might be good to confirm that the cold node does not have ingest role
- Could reconsider more smaller containers, so that you can have 3 hot nodes that also handle ingest and master (but that is just a thought at the moment).
- If you are using custom logs only I presume you also only have a custom data-stream configured? What is the refresh interval and primary/replica count on it? A slightly higher refresh rate is better for ingest (10 seconds for example, or even 1 minute if its okay)
Marius Iversen (Elastic Software Engineer)
(...) starting with #1 is what I would do at least, you can always delete that data afterwards if you are not interested in it
Marius Iversen (Elastic Software Engineer)
Also if you are on K8 I completely forgot, do you use ECK?
alexandrpaliy
Thank you very much, I'll try it. A couple of questions:
0 (general): Initially I thought it's some kind of "plan" of "sequential steps". Now that I re-read it, it looks more like just a set of different "measures" to test/compare - is that correct?
1 (and your latest message): No, I don't use k8s at all, I use pretty simple docker-compose to handle my ELK stack at the moment.
7 (and partially 1): I have tried to use internal monitoring (which is considered deprecated as of now, probably but which is kinda simpler/faster to set up) for both filebeat and logstash. But I'll try metricbeat, np.
12: No, cold node definitely doesn't have ingest role:
$ cat /opt/docker/docker-compose.yml | grep -E '(^ es-\S+:$)|roles'
es-master:
- node.roles=[master, remote_cluster_client]
es-data1:
- node.roles=[data_content,data_hot,data_warm]
es-data2:
- node.roles=[data_cold,data_frozen]
14: No, I think I don't use any custom data-streams (simply because I'm not even sure what does it mean ). I didn't change any "default" elasticsearch settings, except for setting "number_of_replicas": "0" in ES index template.
Marius Iversen (Elastic Software Engineer)
0: It is just different measures correct
Marius Iversen (Elastic Software Engineer)
7: As long as you use docker-compose then that should be enough for metricbeat, instead of K8 module you can use the docker module, it will be nice to see some container stats from them as well
Marius Iversen (Elastic Software Engineer)
12: Stack monitoring requires ingest pipelines, so at least data1 needs to have ingest role. There is no "hot" role anymore either, so you can remove that one
Marius Iversen (Elastic Software Engineer)
Stats that is interesting, especially if you also collect from Logstash/Filebeat is:
Filebeat: Queue size, ingest rate + ack rate should be similar, they are shown in the dashboard.
Logstash, Queue size, ingest rate + ack rate as well if its available.
Elasticsearch, Most stats there is going to be useful, and it will show you if it has problems with any of the current container resources.
Docker Stats: I/O stats for sure is the main one, CPU threads as well if its available
Marius Iversen (Elastic Software Engineer)
There are more as well, but I am writing this from the top of my head, I don't have the stats in front of me
Marius Iversen (Elastic Software Engineer)
Also as a 14th number: Check the configured index refresh rate in that index template you modified, its called refresh_interval: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html#_unset_or_increase_the_refresh_interval
Marius Iversen (Elastic Software Engineer)
if you have not configured the one above, that alone will give you a big boost
Marius Iversen (Elastic Software Engineer)
set it to something like 1m or 5m. (The setting is the lag, from when the data is ingested, to when its searchable), so use a value you are okay with
Marius Iversen (Elastic Software Engineer)
You can always decrease them later again if it works for you
Dain (Elastic Security PME)
Just to pile on - if I understand correctly the issue happened when configured for filebeat -> network -> logstash (console output)? if so i would definitely check the network in detail as well - e.g. a tcpdump checked with wireshark - look for packet loss, tcp window reductions, retransmits, etc
alexandrpaliy
Yes, so far "slowness" appears when network is involved. I have tried to test general "bandwidth" between my filebeat and logstash/elasticsearch nodes with iperf3 - it went easily up to 500 Mpbs (and when filebeat sends data to LS/ES, it hardly reaches 5-10 Mbps). There are also no complex firewalls in front/behind those servers, so I have no reasons to blame network itself so far.
Marius Iversen (Elastic Software Engineer)
If the issue is network between Filebeat - Logstash you should see the queue build up on Filebeat but not on Logstash
Marius Iversen (Elastic Software Engineer)
Similar with Logstash - ES
Marius Iversen (Elastic Software Engineer)
You could start with just looking at the stats in that case first
Marius Iversen (Elastic Software Engineer)
@Dain (Elastic Security PME)
What about also having all 3 nodes as master, two of them for hot + ingest, and second for warm (no cold/frozen for now) ? A bit unsure if that would help
Marius Iversen (Elastic Software Engineer)
Cold/Frozen roles is really only useful if you have object storage (S3, GCS etc) or minio on-prem)
alexandrpaliy
Hmm, I was definitely checking all the metrics available (when turned on legacy "internal" monitoring for filebeat and logstash), but I don't really remember anything about queues there (because, I understand, that's the very first thing I should check in my situation). Maybe I didn't enable something, or maybe that's exactly why internal monitoring is considered deprecated - I'll check what metricbeat will show me, thank you.
Marius Iversen (Elastic Software Engineer)
You could always start with the metricbeat on the server running docker first if you want, as its the easiest one to configure with docker-compose
Marius Iversen (Elastic Software Engineer)
At least to get the docker + ES stats