Hello, i have migrated ELK cluster to new hardware, and also upgraded ELK to 5.5.2.
Cluster consists of:
- master node 2CPU 5G RAM
- 12 data nodes with 6 CPU/12G RAM
- 3 ingest nodes with 6 CPU/6G RAM
- 3 logstash nodes with 6 CPU/6G RAM
I should process about 20k UDP syslog EPS on this cluster (many different syslog streams from different sources). Currently it is able to process only about 1k EPS, rest of UDP packets is dropped (10k dropped packets per second on logstash server eth interface).
I have increased kernel memory for network buffers, increased buffers, and worker count in logstash UDP receiver config as advised in some other threads regarding performance, but it does not seem to help even a bit - Logstash is able to process and load only small numer of messages/sec (about 1000 at max).
I have written python script that processes input syslog stream instead of Logstash, and i have noticed that it looks like it takes a lot of time to bulk load events into ingest ES nodes. Here's the output of my script which loads every 1000 messages in bulk - at first it seems to work fast, but it quickly slows down to a crawl:
[root@elk-new-logstash1 ~]# ./elasticsearch_connector.py
bulk update took: 0:00:00.985152
bulk update took: 0:00:01.850553
bulk update took: 0:00:02.330301
bulk update took: 0:00:03.967893
bulk update took: 0:00:05.164435
bulk update took: 0:00:06.103189
bulk update took: 0:00:07.932095
bulk update took: 0:00:08.482608
bulk update took: 0:00:07.087193
bulk update took: 0:00:07.101610
bulk update took: 0:00:07.688177
bulk update took: 0:00:10.526025
bulk update took: 0:00:13.876477
bulk update took: 0:00:10.901226
bulk update took: 0:00:17.698544
bulk update took: 0:00:12.845201
bulk update took: 0:00:17.934822
bulk update took: 0:00:13.608020
bulk update took: 0:00:14.494025
bulk update took: 0:00:15.938467
bulk update took: 0:00:18.055225
bulk update took: 0:00:21.408280
When i'll start logstash instance, CPU usage of this particulat server also looks similar - at first logstash takes about 400% CPU so it seems like it's processing a lot of data, then usage drops to 100 - 130% (waiting for ingest nodes to accept the data?)
Data nodes and ingest nodes seem like they are not doing too much (CPU low, heap/overall memory rather low):
iostat from one of the data nodes - also seems like it's not utilizing I/O too much:
So i'm not sure where to look for the culprit. At first i was convinced that the Logstash is performing too slow, but after seeing output of my test Logstash replacement (python script which i have mentioned), i'm looking more in the direction of Elasticsearch. Still i have no clue what else to look for, and how to fix this... Also - i have turned number of replicas on the most active index to 0, and nothing changed performance wise.