I'm trying to determine the bottleneck for my Netflow setup, to see if I can further optimize the performance.
I am ingesting Netflow traffic into a Linux server running both filebeat and elasticsearch 7.1.4. I'm using the Netflow module that came with filebeat to write to ES.
I have made some performance tuning to increase the indexing rate, from ~1.5K/s using default values to ~13.5K/s now. But my Netflow ingest is still at a much higher rate, and I hope to be able to increase the indexing rate further. Problem is, I don't know if the bottleneck is at filebeat or ES.
I have configured the following settings to increase the indexing rate:
- output.elasticsearch.bulk_max_size: 4000 - output.elasticsearch.worker: 8 - queue.mem.events: 64000 - queue.mem.flush.min_events: 4000
- queue_size: 64000
My server has the following specs:
- 125GB RAM (64GB ringfenced for ES's JVM heap)
- 48 CPUs
- 9.6TB HDD total
Current utilization rate (obtained from the Stack Monitoring dashboard for ES node):
- CPU: 15-25%
- Memory usage (JVM Heap): Fluctuates between 6GB and 45GB (10-70%)
- I/O operations rate: typically around 120/s, can occasionally spike to 200/s
- System load: 10-18
- Disk available: 1.1TB/7.3TB (I set a ILM policy so that the disk available doesn't fall below 15%)
I currently have 196 indices/primary shards, and 0 replica shards, containing 8.4B documents. Each index will rollover at ~50GB.
I think my filebeat is dropping packets, because I always see the following in the filebeat output (I'm running it as
I also don't know if this helps (from running
gc.collectors.young.collection_count: 132997 gc.collectors.young.collection_time_in_millis: 3530938 gc.collectors.old.collection_count: 0 gc.collectors.old.collection_time_in_millis: 0
What can I do to determine where the bottleneck is?