Filebeat Netflow Module || Kibana Dashboard Extremly Slow

I've deployed the Netflow module and when I go to the Filebeat Netflow Dashboards and I query for 24 hours, today or 4 hours the query is extremely slow.

Environment

  • ELK 7.12
  • 3 data nodes (12 vCPU, 10 RAM -Xms5g m / -Xmx5g, no swap, ulimit open files) - each server is in a different disk (SAS 1TB)
  • 1 Kibana
  • 1 Logstash / Filebeat

I do not see problems related to my hardware resources.

Index Filebeat

    1. Index size: 145GB
    1. 1 shard and 1 replica
    1. Docs count: 119919740 (119 million)
    1. I already tried to use 2 replicas, same result.

Others dashboards work perfectly, for example, Packetbeat DNS Overview, Metricbeat Overview. vSphere etc. My problem is only with Filebeat Netflow Module.

I was using ELK 7.11.1 - I updated to 7.12 trying to solve the problem, but without success.

What does Monitoring show when you are running this dashboard?

Index size of 145 GB with one primary shard is the issue I think. One primary shard should Not be bigger than 50 GB. So try split the index into 3 primary with 1 Replica pls.

1 Like

@Felix_Roessel and @warkolm

I tried to query with different sizes of the index (15GB, 30GB, 50GB etc). As mentioned above, I tried with two replicas (same problem) but not with more shards.

When I am waiting for the result of my query, I cannot do anything else with Kibana. Monitoring just freeze. So, looking at the stack monitoring I have noticed higher CPU utilization in node 1 where is the primary is stored. However, even giving more and more CPU the high utilization is the same - 100% CPU.

I cannot understand why this dashboard is consuming too much CPU. I think I gave for each server is enough to start. Now I have only this index with less than 24 hours of logs/data/netflow traffic ----- 3 data nodes (12 vCPU, 10 RAM -Xms5g m / -Xmx5g, no swap, ulimit open files) - each server is in a different disk (SAS 1TB). I also increase the number of threads

I think the question is, is there someone that uses Filebeat Netflow Module and it really works?

GET _cat/shards/filebeat-000001

filebeat-000001 0 p STARTED 38309542 23.2gb 172.20.11.43 elk-node-1
filebeat-000001 0 r STARTED 38308346 23.1gb 172.20.11.44 elk-node-2

GET _cat/indices/filebeat-000001

green open filebeat-000001 g0SBMcRgSoSBFgvOSr8wIg 1 1 38326478 0 46.4gb 23.2gb

GET filebeat-000001/_ilm/explain

{
  "indices" : {
    "filebeat-000001" : {
      "index" : "filebeat-000001",
      "managed" : true,
      "policy" : "filebeat_policy",
      "lifecycle_date_millis" : 1618274869201,
      "age" : "16.93h",
      "phase" : "hot",
      "phase_time_millis" : 1618274869502,
      "action" : "rollover",
      "action_time_millis" : 1618274869884,
      "step" : "check-rollover-ready",
      "step_time_millis" : 1618274869884,
      "phase_execution" : {
        "policy" : "filebeat_policy",
        "phase_definition" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_size" : "30gb",
              "max_age" : "1d"
            },
            "set_priority" : {
              "priority" : 100
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1618267042453
      }
    }
  }
}

--

--

--

--

--

@albertosoares you may be interested in the ElasticON event being held tomorrow, where ElastiFlow customer Alex Germain from OHSU will share his experience building an Elastic Stack environment for network traffic analytics at scale. https://www.elastic.co/elasticon/public-sector/state-local-government-education#agenda. We helped him standup an environment that can query 5 billion+ flow records and the dashboards still render in 5-10 seconds. It achieves this even though the ElastiFlow records go deeper and are more complex than the simple Filebeat results.

A few comments on your current environment:

  1. vCPUs are not physical CPU cores. The general rule is vCPU/2... so basically you have the equivalent of 6 real cores per node. This isn't a lot, especially considering that Kibana queries are largely aggregations, and there is math to be done.

  2. The other issue is the single shard. On a 3 node cluster the recommendation by @Felix_Roessel would be best to spread the load among all nodes.

  3. You mention SAS drives. Are these spinning HDDs or SSDs? HDDs are going to be almost unusable at the volume of data you are trying to query. If using HDDs I suspect if you use top rather than htop you will find a high percentage of IO wait. To understand just how bad spinning disks are, you may want to check out this video. While the tests are ingest focused, the laws of physics for HDDs are the same.

To give you an idea of how a more ideal environment performs. I did a query in one of our lab environments. It only has 102 million flow records (so a little less than yours), but as I mentioned the records are more complex. This is a single server 16 core/32 thread processor, 128GB RAM (31GB JVM Heap). There are dual data paths, each on an NVMe SSD. The index has two shards, so one per SSD. When launching our most complex dashboard with all 102M records it returns in under 3.5 seconds. This is the difference that the right environment can make.

@Felix_Roessel

I configured my index with 3 replicas, one in each node of the cluster, it helped, however, I still have to wait for 1 or 2 minutes when I query 24 hours of data. Following the recommendations of @rcowart, I changed the server to a better one.

@rcowart

Thank for sharing the ElasticON, I watched the presentation of Alex Germain from OHSU, what you guys did was amazing. He is using a great server to host ELK solution (all NVMe, good CPU and Memory).

I installed the ElastiFlow to compare with the Filebeat Netflow Module. I had a better performance - of course, the dashboards are different (better). ElastiFlow is better than Filebeat Module about dashboards, performance, structure and the collector is awesome. The problem is the number of fields when using free versions. Also, you have to improve the ILM (custom template, index name (allowing 00001) and other things) otherwise, ILM policy will not work properly and do it manually is not a good idea.

So, about my disks and CPUs, I am running the ELK solution in a new server with better CPU, Memory and with one disk NVMe. I move my 3 nodes for my 2TB NVMe disk and to divide the CPU workload, the are 3 primaries shards (one per elasticsearch node) and 1 replica as @Felix_Roessel recommended.

Waiting for more logs to query and analyze the results.

@Felix_Roessel @rcowart @warkolm

I did a lot of tests with Filebeat NetFlow Module and ElastiFlow. Unfortunately, Filebeat netflow module is not good yet. Even using disks NVMe, more CPU and Memory the result of a query (24h) is terrible, 2 - 4 minutes. When I did a similar query with ElastiFlow it took 1 or 2 seconds.

I am gonna use ElastiFlow free edition.

@rcowart please give me directions to deploy my ILM policy (everything automatically as mentioned before) - send me a direct message.