How to determine the bottleneck between Filebeat and ES?


I'm trying to determine the bottleneck for my Netflow setup, to see if I can further optimize the performance.

I am ingesting Netflow traffic into a Linux server running both filebeat and elasticsearch 7.1.4. I'm using the Netflow module that came with filebeat to write to ES.

I have made some performance tuning to increase the indexing rate, from ~1.5K/s using default values to ~13.5K/s now. But my Netflow ingest is still at a much higher rate, and I hope to be able to increase the indexing rate further. Problem is, I don't know if the bottleneck is at filebeat or ES.

I have configured the following settings to increase the indexing rate:

In filebeat.yml:

- output.elasticsearch.bulk_max_size: 4000
- output.elasticsearch.worker: 8
- 64000
- queue.mem.flush.min_events: 4000

In netflow.yml:
- queue_size: 64000

My server has the following specs:

  • 125GB RAM (64GB ringfenced for ES's JVM heap)
  • 48 CPUs
  • 9.6TB HDD total

Current utilization rate (obtained from the Stack Monitoring dashboard for ES node):

  • CPU: 15-25%
  • Memory usage (JVM Heap): Fluctuates between 6GB and 45GB (10-70%)
  • I/O operations rate: typically around 120/s, can occasionally spike to 200/s
  • System load: 10-18
  • Disk available: 1.1TB/7.3TB (I set a ILM policy so that the disk available doesn't fall below 15%)

I currently have 196 indices/primary shards, and 0 replica shards, containing 8.4B documents. Each index will rollover at ~50GB.

I think my filebeat is dropping packets, because I always see the following in the filebeat output (I'm running it as filebeat -e):

I also don't know if this helps (from running GET /_nodes/stats):

gc.collectors.young.collection_count: 132997
gc.collectors.young.collection_time_in_millis: 3530938
gc.collectors.old.collection_count: 0
gc.collectors.old.collection_time_in_millis: 0

What can I do to determine where the bottleneck is?

Thank you.

For high ingest throughput SSDs are recommended. Storage performance could therefore be the bottleneck. Run iostat -x and check await and disk utilisation to see if this might be the case.

How many primary shards are you actively indexing into?

Usually only one at a time, or 2 when it is about to rollover to another index.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.