Performance hit when multiple filebeats are sending to same ES

Hi,

I have a total of 5 servers, all sending Netflow data using filebeat to the same server (1 of the 5 servers) running ES. Each server is also running 2 instances of filebeat, so in total, I have 5 x 2 filebeat instances.

  • When I only have 1 filebeat instance running, my index rate can go up to about 24K/s
  • With 2 filebeat instances running (be it on the same server, or 1 instance from 1 server each), the total index rate can go up to about 37K
  • With 3 filebeat instances, the highest index rate I've seen is about 42K/s, but usually about 30+K/s

So it seems more filebeat instances don't result in a linear increase in Index rate.

I also have similar observations from the Stack Monitoring -> Beats -> Instances page.

  • When 1 filebeat instance is running, the Event Rate is ~24K/s
  • When 3 filebeat instances are running, the Event Rate for all 3 filebeat instances drop to 11-12K/s.
  • When 5 filebeat are instances running, the Event Rate for all 5 filebeat instances drop to ~6-8K/s, meaning my Index Rate is about 33K/s.

I have the following settings in my filebeat.yml.

queue.mem.events: 64000
queue.mem.flush.min_events: 4000

output.elasticsearch.bulk_max_size: 4000
output.elasticsearch.worker: 8

A few questions:

  1. I'm trying to understand why the Event Rate seem to be "equally divided" among my filebeat instances, and if there is any way to improve the performance.
  2. When I have 2 instances of filebeat running on the same server, does the queue.mem settings apply to each filebeat instance individually, or do both instances "share" the queue?
  3. From the documentation, it says that output.elasticsearch.worker is "the number of workers per configured host publishing events to Elasticsearch." So when I have 3 filebeat instances running (say, 2 on the same server + 1 from another server), are 3 x 8 workers are started on my ES, or 2 x 8 workers? I have 96 CPUs on my server, how many workers is appropriate?
  4. I noticed that on the Stack Monitoring -> Beats -> Instances page, when I have 2 filebeat instances running on the same server, only one of the 2 filebeat instances are shown, and it seems to randomly alternate between the 2. Why is this so?

Thank you.

I looked at the filebeat logs, and found that periodically, for a minute, no packets were dropped. Usually, I would see

input.netflow.packets.dropped: 20000
input.netflow.packets.received: 8000

But there are times where there would be no dropped packets and the number of received packets would be much higher. This usually occurs when memstats.gc_next and memstats.memory_alloc are the lowest compared to a few minutes before and after.

Is my performance affected by memory? I don't quite understand the explanation of beat.memstats.gc_next in this post, but my memstats.memory_alloc is always lesser than memstats.gc_next.

I have set the JVM heap size of my ES to 64GB, where the server has 755GB of RAM:

-Xms64g
-Xmx64g

Should I decrease this to 32GB?

Based on your observations it seems like your general cumulative/total ingest rate is around ~30-40k events/second. This sounds more like a limitation on what Elasticsearch is able to process rather than a limitation on the Filebeat end.

I have set the JVM heap size of my ES to 64GB, where the server has 755GB of RAM

64GB is too high, you should be around 26-30GB Heap to allow for Compressed objects for ideal performance.

See: Advanced configuration | Elasticsearch Guide [8.8] | Elastic

Set Xms and Xmx to no more than the threshold for compressed ordinary object pointers (oops). The exact threshold varies but 26GB is safe on most systems and can be as large as 30GB on some systems.

Which version of Elasticsearch and Filebeat are you using?

With the amount of resources you have on that system, there is probably far more RAM than can be used by a single Elasticsearch instance. Depending on the storage available, you might want to run multiple Elasticsearch instances on that server and then give each one dedicated storage. Then increase the shard count of the backing index to spread the load.

(In my opinion, 30-40k events/second is pretty OK for a single instance of Elasticsearch to be processing).

I'm using version 8.3.3 for both filebeat and ES.

I'll reduce the heap size and see how it goes.

Unfortunately, there isn't a lot of disk space on this server, so I'm not sure how running multiple ES instances would work.

Hmm, ok, I'd say also look at your disk stats, mainly around the IOWait time to see if you're being bottlenecked by your disks.

There have been some various improvements since 8.3.3, so you might want to test upgrading to something newer (8.8.1 was just released this morning), but I don't think you'll see anything significant improvement wise (maybe just a few more % increase, if that).

Most of the times the bottleneck is on the receiving end, in this case in Elasticsearch and the main issue for bottlenecks in Elasticsearch is the disk.

What disk do you have? It is SSD or HDD?

Are you sending the monitoring data to the same cluster? This can heavily impact on performance.

Should I use iotop / iostat, or some other tools for this?

I'm using SATA SSDs.

Yes, everything is going to the same ES. I'll try turning off Metricbeat to see if there's any improvement.

Should I use iotop / iostat, or some other tools for this?

Either should work, you just want to see if your server is spending a lot of time waiting on the disk(s) to do work.

1 Like

I've read that ES works better scaling horizontally than vertically. Would a better approach be running multiple ES on different physical servers, than multiple ES on the same server? I have a total of 5 servers, so I'm thinking of maybe running an ES on each of them, forming a cluster. I would have more hard disk space then as well.

An ES node is assigned all roles (master, data, etc) by default, right? So I can have 5 masters/data ES nodes?

I have a total of 5 servers, so I'm thinking of maybe running an ES on each of them, forming a cluster. I would have more hard disk space then as well.

Yes, in theory you should see better performance the more servers you add. (You can in theory have multiple Elasticsearch instances on the same server, you'd just want to make sure that the disks aren't really shared, but if you have more disks in more servers, then in theory that would work as well).

One thing to note, you will want to adjust the index.number_of_shards on the index(es) you're using, by default, Elasticsearch sets this value to 1, which doesn't really "scale". You can generally set the value to the number of nodes you have in your cluster. (You can sometimes even increase this number more if you're not seeing resource utilization saturation to push things even further, though it generally has dimensioning returns).

When I set this value to say, 5, in each of the 5 elasticsearch.yml files in the 5 servers, does it mean there's 5 shards per index, or 5*5 shards per index?

You'll want to set this setting on the index (or generally index template) itself, this setting isn't something that goes in elasticsearch.yml

These are the things I've changed, but there hasn't been any significant improvement. I'm now running a single filebeat instance.

  1. Changed JVM heap size to 30GB
  2. Added another node to the ES cluster (2 physical servers), making it a 2-node cluster (total JVM heap is now 60GB)
  3. 2 shards per index
  4. Turned off metricbeat
  5. Changed index rollover size in ILM from 50GB to 40GB

Changing output.elasticsearch.worker in filebeat.yml from 8 to 12 or 16 did not improve the performance, so I left it at 8.

Some small improvements I've observed since the changes.

  1. Usually, the number of received packets in 30 seconds is about 8K (drop about 22K). Now, it's usually about 10K received packets, 20K dropped packets.
  2. When the index rollover size was 50GB, on the Indexing Rate chart on Kibana for the current ES index, the Indexing Rate fluctuates between 20 and 30K/s. After changing the size to 40GB, it fluctuates less, between 26 and 29K/s.

Not sure if this matters, but I also noticed that when the index rollover size was 50GB, the index's merges.total_throttled_time_in_millis can be as high as 50% of merges.total_time_in_millis. I got this by looking at the Stats of the index under Index Management. For example, I would see (for past indices)

"merges": {
    "current": 0,
    "current_docs": 0,
    "current_size_in_bytes": 0,
    "total": 178,
    "total_time_in_millis": 5675026,
    "total_docs": 62632665,
    "total_size_in_bytes": 65673150397,
    "total_stopped_time_in_millis": 13460,
    "total_throttled_time_in_millis": 2712266,
    "total_auto_throttle_in_bytes": 67806790
}

After changing the rollover size to 40GB, this has decreased to about 25-30%, e.g.

"merges": {
    "current": 0,
    "current_docs": 0,
    "current_size_in_bytes": 0,
    "total": 1639,
    "total_time_in_millis": 10781764,
    "total_docs": 158044730,
    "total_size_in_bytes": 171873293917,
    "total_stopped_time_in_millis": 10079,
    "total_throttled_time_in_millis": 3052129,
    "total_auto_throttle_in_bytes": 132788866
}

The number of merges seems to have increased, though.

Something else I observed:

As I mentioned previously, about every 3 minutes, I get zero packet drop and all packets are received for about 2 min, then the packets start dropping again. This usually occurs when memstats.gc_next and memstats.memory_alloc are the lowest compared to a few minutes before and after.

The indexing rate drops to 0 at around the same time the packet drop decreases to 0 (about 30K packets received at this time). Indexing rate stays at 0 for about 20-40 seconds before increasing, while packet drop remains at 0 for about 2 more min.

What else can I try?

The indexing rate drops to 0 at around the same time the packet drop decreases to 0 (about 30K packets received at this time). Indexing rate stays at 0 for about 20-40 seconds before increasing, while packet drop remains at 0 for about 2 more min.

Sounds like potentially garbage collection doing something here. How many CPU cores does Elasticsearch have access to?

Were you ever able to check if your servers have a lot of wait time on IO (the SATA SSDs)?

What happens if you increase shards per index from 2 to 4? You can have multiple shards from the same index on the same node and get some additional performance (if you are disk IO limited).

The merge times I think might be misleading, you might have gone from infrequent long merges to more frequent shorter merges, but (I didn't do any real math here), overall you probably have similar average merge time. Long merge times can also be an indicator of disk IO limits (as to my understanding) they rely heavily on read/writes.

Yeah, it could be GC. The cycle at which this packet drop/zero indexing rate occurs is quite regular.

How do I check how many CPU cores ES has access to? I did not do any specific configurations, so I assume it can use all 96? I can see many, if not all, of the CPUs in use when running top. The average CPU usage is roughly 20-25%.

When I ran iostat, the %iowait is 0.03, and kB_wrtn/s is quite stable at 137K.

There is some improvement when I increased the number of shards from 2 to 4.

  • When running 1 filebeat instance, the indexing rate went from 24K/s to 28K/s.
  • For 2 filebeat instances, the total indexing rate went from 34K/s to 42K/s.
  • But when the number of shards is kept constant (be it 2 or 4), increasing the number of filebeat instances reduces the indexing rate of a single instance. For example, at 4 shards, when there was only 1 filebeat, the indexing rate was 28K/s, but when there were 2 filebeats, the indexing rate of each filebeat dropped to 21K/s.

I've read that the recommended shard size is 10-50GB. I've kept my index size at 40GB, so I can at most have 4 shards (I did try with 8 shards, and only got 1% improvement in performance). Unless I increase my index size to, say, 100GB, then I can have up to 10 shards. Should I just increase my index size so I can parallelize more with more shards (yet at the same time, try not to have too many that degrades search performance)?

I'm trying to understand the benefits of having multiple nodes in a cluster. It seems like the number of shards is the main limiting factor for indexing performance. How does having more nodes help, as I can only write to 1 index each time, and can't increase the number of shards anymore (since it depends on index size) whether I have one node or 100 nodes? Or do the shards spread across different nodes such that it's not only writing to one node at a time?

How do I check how many CPU cores ES has access to?

If you aren't using containers and Elasticsearch is just running on the server as a regular process, it should have access to all of the CPU cores.

When I ran iostat , the %iowait is 0.03, and kB_wrtn/s is quite stable at 137K.

Hmm, 137MBps could potentially be hitting a write limit of your SATA SSDs depending on the model they are. Would you be able to provide the average IOPS on the SSDs?

For example, at 4 shards, when there was only 1 filebeat, the indexing rate was 28K/s, but when there were 2 filebeats, the indexing rate of each filebeat dropped to 21K/s.

This is somewhat to be expected, you went from a total of 28K/s to 42K/s, meaning you're seeing an overall throughput increase. When you see the single 28K/s Filebeat, does it have a backlog/dropped events, or is it able to send/process all data it receives?

Should I just increase my index size so I can parallelize more with more shards (yet at the same time, try not to have too many that degrades search performance)?

You should try to "size" your index based on shard size rather than index size. If you are using Elasticsearch's ILM policies, you should be able to use the max_primary_shard_size setting to achieve this.

I'm trying to understand the benefits of having multiple nodes in a cluster.

There are 2 main reasons to have multiple nodes in a cluster:

  1. High Availability - Depending on your use case, this may or may not matter
  2. Greater ability to distribute load
    • Given this thread, this is probably what you're more interested in.
    • If you have a single node, you are limited to the resources that the node has, at some point you we hit a bottleneck (CPU, Memory, Disk). Given the specs of your systems, I suspect you will hit the Disk bottleneck first ().
    • By having multiple servers (and multiple shards in your index), you can more easily spread the resource load improving the general overall performance (or in your case, event throughput).
    • Note: Unless you are using a coordinating node, I'd make sure that you setup Filebeat Elasticsearch output to have all of your Elasticsearch nodes listed, so that Filebeat can use Round Robin output load distribution.

I ran iostat -xd 3 /dev/sda over 25 minutes to get a more accurate reading. After averaging the numbers, I got the following.

Average w/s: 313.93
Average IOPS (r/s + w/s): 314.78
Max w/s: 1098

Average wkB/s: 68370.74
Max wkB/s: 355846.67

Average w_await: 0.691
Average %util: 3.38

There is always packet drop. In this case, the packet drop for single 28K/s filebeat was 33%, while the packet drop for two filebeat at 42K/s was higher at 43%.

Thanks for this. I just realized that my Filebeat was only pointing to one ES. I'll add the other ES and see if there's any improvement.

I'm added the other ES to the filebeat config file, but there doesn't seem to be much effect on the performance.

Hmm, that is really weird.

Would you be able to do the following?

While you're ingesting data from Filebeat, can you run the command GET /_cat/thread_pool against the cluster? It should show what thread pools are actively being used and if there are any bottlenecks.

Also, would you be able to provide the output of GET /_cluster/stats?

I'm now running 3 ES nodes in the cluster, with 1 filebeat instance. The highest indexing I've seen is 28.7K/s, but it's on average above 25K/s. The packet drop is about 32%.

I ran this a few times, and the non-zero threads are

node-3 management           1 0 0
node-3 write                3 0 0
node-4 management           1 0 0
node-4 search_coordination  1 0 0
node-4 refresh              1 0 0
node-4 write                4 0 0
node-5 management           1 0 0
node-5 refresh              1 0 0
node-5 search_coordination  1 0 0
node-5 write                5 0 0

Only the _write and _management threads are non-zero all the time, the others are sometimes all zeros. Also, the _write range between 1 0 0 and 5 0 0 (as seen thus far), and the highest value alternate among the 3 nodes, i.e. no one node is consistently the highest.

I can't copy the entire output here. Is there any specific values you're looking for? I've copied the ones I thought were relevant here.

"indices": {
  "count": 149,
  "shards": {
    "total": 865,
	"primaries": 821,
	"replication": 0.05359,
	"index": {
	  "shards": {
	    "min": 1,
		"max": 8,
		"avg": 5.805
	  },
	  "primiaries": {
	    "min": 1,
		"max": 8,
		"avg": 5.510
	  },
	  "replication": {
	    "min": 0,
		"max": 1,
		"avg": 0.295
	  }
	}
  },
  "docs": {
	"count": 5614635389,
	"deleted": 2961229
  },
  "store": {
	"size_in_bytes": 4458135794391,
	"total_data_set_size_in_bytes": 4458135794391,
	"reserved_in_bytes": 0
  },
  "fielddata": {
	"memory_size_in_bytes": 76208,
	"evictions": 0
  },
  "query_cache": {
	"memory_size_in_bytes": 39251123,
	"total_count": 36180242,
	"hit_count": 7489377,
	"miss_count": 28690865,
	"cache_size": 59627,
	"cache_count": 338184,
	"evictions": 278557
  },
  "completion": {
	"size_in_bytes": 0
  },
  "segments": {
	"count": 19647,
	"memory_in_bytes": 0,
	"terms_memory_in_bytes": 0,
	"stored_fields_memory_in_bytes": 0,
	"term_vectors_memory_in_bytes": 0,
	"norms_memory_in_bytes": 0,
	"points_memory_in_bytes": 0,
	"doc_values_memory_in_bytes": 0,
	"index_writer_memory_in_bytes": 27677624,
	"version_map_memory_in_bytes": 0,
	"fixed_bit_set_memory_in_bytes": 14957976,
	"max_unsafe_auto_id_timestamp": 1687136221116,
	"file_sizes": {}
  },
  ...,
  "nodes": {
	"os": {
	  "available_processors": 288,
	  "allocated_processors": 288,
	  ...,
	  "mem": {
	    "total_in_bytes": 2432527048704,
		"adjusted_total_in_bytes": 2432527048704,
		"free_in_bytes": 24682225664,
		"used_in_bytes": 2407844823040,
		"free_percent": 1,
		"used_percent": 99
	  }
	},
	"process": {
	  "cpu": {
		"percent": 9
	  },
	  "open_file_descriptors": {
		"min": 3141,
		"max": 3401,
		"avg": 3265
	  }
	},
	"jvm": {
	  ...,
	  "mem": {
		"heap_used_in_bytes": 33092436752,
		"heap_max_in_bytes": 96636764160
	  },
	  "threads": 1367
	},
	"fs": {
	  "total_in_bytes": 19993490964480,
	  "free_in_bytes": 15529135980544,
	  "available_in_bytes": 14521446711296
	},
	"ingest": {
	  "number_of_pipelines": 3,
	  "processor_stats": {
	    "conditional": {
		  "count": 22525863239,
		  "failed": 0,
		  "current": 2,
		  "time_in_millis": 201656155
		},
		"geoip": {
		  "count": 15017242898,
		  "failed": 0,
		  "current": 0,
		  "time_in_millis": 64156818
		},
		...,
		"rename": {
		  "count": 30034485796,
		  "failed": 0,
		  "current": 0,
		  "time_in_millis": 23136001
		},
		...,
		"set": {
		  "count": 7508621451,
		  "failed": 0,
		  "current": 0,
		  "time_in_millis": 31491921
		}
	  }
	},,
	"indexing_pressure": {
	  "memory": {
		"current": {
		  "combined_coordinating_and_primary_in_bytes": 0,
		  "coordinating_in_bytes": 0,
		  "primary_in_bytes": 0,
		  "replica_in_bytes": 0,
		  "all_in_bytes": 0
		},
		"total": {
		  "combined_coordinating_and_primary_in_bytes": 0,
		  "coordinating_in_bytes": 0,
		  "primary_in_bytes": 0,
		  "replica_in_bytes": 0,
		  "all_in_bytes": 0,
		  "coordinating_rejections": 0,
		  "primary_rejections": 0,
		  "repica_rejections": 0
		},
		"limit_in_bytes": 0
	  }
	}
  }
}