Strange, very slow performance despite of lots of CPU/RAM resources

Hello, i have migrated ELK cluster to new hardware, and also upgraded ELK to 5.5.2.

Cluster consists of:

  • master node 2CPU 5G RAM
  • 12 data nodes with 6 CPU/12G RAM
  • 3 ingest nodes with 6 CPU/6G RAM
  • 3 logstash nodes with 6 CPU/6G RAM

I should process about 20k UDP syslog EPS on this cluster (many different syslog streams from different sources). Currently it is able to process only about 1k EPS, rest of UDP packets is dropped (10k dropped packets per second on logstash server eth interface).

I have increased kernel memory for network buffers, increased buffers, and worker count in logstash UDP receiver config as advised in some other threads regarding performance, but it does not seem to help even a bit - Logstash is able to process and load only small numer of messages/sec (about 1000 at max).

I have written python script that processes input syslog stream instead of Logstash, and i have noticed that it looks like it takes a lot of time to bulk load events into ingest ES nodes. Here's the output of my script which loads every 1000 messages in bulk - at first it seems to work fast, but it quickly slows down to a crawl:

[root@elk-new-logstash1 ~]# ./
bulk update took: 0:00:00.985152
bulk update took: 0:00:01.850553
bulk update took: 0:00:02.330301
bulk update took: 0:00:03.967893
bulk update took: 0:00:05.164435
bulk update took: 0:00:06.103189
bulk update took: 0:00:07.932095
bulk update took: 0:00:08.482608
bulk update took: 0:00:07.087193
bulk update took: 0:00:07.101610
bulk update took: 0:00:07.688177
bulk update took: 0:00:10.526025
bulk update took: 0:00:13.876477
bulk update took: 0:00:10.901226
bulk update took: 0:00:17.698544
bulk update took: 0:00:12.845201
bulk update took: 0:00:17.934822
bulk update took: 0:00:13.608020
bulk update took: 0:00:14.494025
bulk update took: 0:00:15.938467
bulk update took: 0:00:18.055225
bulk update took: 0:00:21.408280

When i'll start logstash instance, CPU usage of this particulat server also looks similar - at first logstash takes about 400% CPU so it seems like it's processing a lot of data, then usage drops to 100 - 130% (waiting for ingest nodes to accept the data?)

Data nodes and ingest nodes seem like they are not doing too much (CPU low, heap/overall memory rather low):

iostat from one of the data nodes - also seems like it's not utilizing I/O too much:

So i'm not sure where to look for the culprit. At first i was convinced that the Logstash is performing too slow, but after seeing output of my test Logstash replacement (python script which i have mentioned), i'm looking more in the direction of Elasticsearch. Still i have no clue what else to look for, and how to fix this... Also - i have turned number of replicas on the most active index to 0, and nothing changed performance wise.

FYI we’ve renamed ELK to the Elastic Stack, otherwise Beats feels left out :wink:

Can you elaborate more on what the hardware changes were, and what version you were on previously?

Just one master?

I have to check which Elastic version i had on previous setup, anyway "new" hardware is a bunch of IBM x3650 servers with Xen 7.2 hypervisor and CentOS 7 as VM's on which Elastic Stack operates. "New" setup is not so much new, as these servers are recycled from some other application and quite old. Still, cluster should perform a lot better than it does, as i have used even older and slower hardware (Sun Blade X6250 servers, also with Xen 7) on previous setup without such problems. Currently i suspect that there is some hardware problem with disk array in one of the x3650 Xen hosts - i need to troubleshoot this before getting back to testing Elastic Stack performance. And regarding only one master node - you're right, i should add redundant master nodes on each Xen host and i'll probably do this when i'll resolve performance problems.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.