Collection and storage of information from netflow sources

BugS · June 12, 2023, 10:34am

Hello everyone. Maybe it's a newbie question, but I have limited experience with ELK. Currently, I'm trying to use it as a netflow collector. I've configured everything according to the documentation, but I'm a bit concerned about the amount of data that the indexes end up occupying. I tried to estimate how much data one source sends, and based on the statistics, it seems to be around 800 MB per hour, while the indexes occupy approximately 8-9 GB for the same period. Is this normal overhead, or did I configure something incorrectly?

carly.richmond · June 21, 2023, 10:16am

Hi @BugS,

Welcome to the community! Can you share a bit more of how you are ingesting Netflow data into Elasticsearch and what guide you are following?

BugS · June 21, 2023, 1:44pm

Hi @carly.richmond ,
Thank you for your answer.
I have installed and configured Filebeat according to this document: Configure Filebeat | Filebeat Reference [7.17] | Elastic.
I also enabled and configured the netflow module within Filebeat as described in: NetFlow module | Filebeat Reference [7.17] | Elastic.
The output utilizes an Elasticsearch cluster consisting of three nodes.

carly.richmond · June 21, 2023, 3:18pm

Thanks for the additional details @BugS. I don't see anything obvious in the documentation regarding volumes. Filebeat does have processors that can be used to transform data, including drop_event and drop_fields if you think there are particular events or fields that you don't need.

BugS · June 23, 2023, 7:59pm

No, the NetFlow stream received by filebeat doesn't contain any unnecessary fields or anything that could be discarded.

stephenb · June 24, 2023, 4:38am

Hi @BugS

10x seem very high.

A couple questions

How did you come up with the 800mb / hour ... From the source?

Are you sure you are not ingesting other data?

If you go to Discover with index pattern filebeat-* and set the time picker for 1 hour of the data and filter on the event.dataset : "netflow.log" for netflow how many events.

Then go to Kibana Dev Tools and run

GET _cat/*filebeat*indices/?v

And share the results.. especially for the filebeat indices.

And let's take a look at your numbers.

My netflow generator creates about 800 bytes per event/document primary shard (not saying that is exactly what you will get) double with a replica. Your neflow may be a bit larger,

Let see what yours looks like.

BugS · June 27, 2023, 8:00pm

@stephenb
Hi Stephen
The device that sends NetFlow is a Cisco router, which simply allows you to see statistics on the volume of data sent by the NetFlow exporter. Currently, I am experimenting with only one device, so I can confidently state that no further data is being received by the NetFlow collector. The volume of data within one hour varies depending on the device's load. I provided the peak volume in my first post, but in any case, the 10x ratio holds true - 100 MB of NetFlow data sent by device will result in approximately 1000 MB of Elasticsearch indexes on disk. If events are filtered with the pattern "filebeat-*," I currently see 1,996,779 hits (I have attached a screenshot).

The result of executing "GET _cat/filebeat indices/?v" is as follows:

health status index                             uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   filebeat-7.17.9-2023.06.27-000375 nH4e9XgyQpyYZp0zDJB6Vg   1   1    3054795            0      2.1gb            1gb
green  open   filebeat-7.17.9-2023.06.27-000376 3F5uYGlcRWiYsqMT0tx8fg   1   1    2448618            0      1.7gb        893.3mb
green  open   filebeat-7.17.9-2023.06.27-000377 nLzJ1Co-RAq0JS77GdNiLw   1   1    1363118            0        1gb          533mb

However, currently, the load on the device is quite low. During the day, the volumes will be much higher.

stephenb · June 27, 2023, 8:21pm

Hi @BugS

So the quick math says that your netflow documents on disk are about ~400bytes / document which is quite reasonable.

893.3MB / 2448618 Docs = 382 bytes / doc

And at about 2M Docs / Hour that is about 800MB/hour primary storage.

2M / hour equals ~555 Events Second = ~200Kb/sec

How is that measured 100MB? That does not seems right ... pretty confident it is not a 10x factor on data size. You should be able to see the documents in Discover.

The ratio of document sent to primary on disc should be about 1.1-2x at the most...

BugS:

health status index                             uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   filebeat-7.17.9-2023.06.27-000375 nH4e9XgyQpyYZp0zDJB6Vg   1   1    3054795            0      2.1gb            1gb
green  open   filebeat-7.17.9-2023.06.27-000376 3F5uYGlcRWiYsqMT0tx8fg   1   1    2448618            0      1.7gb        893.3mb
green  open   filebeat-7.17.9-2023.06.27-000377 nLzJ1Co-RAq0JS77GdNiLw   1   1    1363118            0        1gb          533mb

BTW did you adjust the ILM policy to be 1GB (hopefully for testing only) you would not want to do that for production / large volumes.

BugS · June 28, 2023, 9:25am

Hi @stephenb
Just for example
Statistics from netflow source for one hour:

Flow Exporter elk:
  Packet send statistics (last cleared 01:00:25 ago):
    Successfully sent:         304796                (319794384 bytes)

319794384 bytes = ~ 320 Mbytes

Indexes on the elastic node for the same hour:

health status index                             uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   filebeat-7.17.9-2023.06.28-000002 9XEAyYuVTGK8q32_Wwcn6g   1   1    1863823            0      1.3gb        686.9mb
green  open   filebeat-7.17.9-2023.06.28-000001 PyBbhFgXRDKngABQ0zVYIg   1   1    2984925            0      2.1gb            1gb

686 Mbytes + 1 Gbyte = 1686 Mbytes

1686 / 320 = ~ 5

Yes, I agree that I calculated it somewhat incorrectly, because I should have calculated based on the volume of the pri.store.size, but it still comes out to x5, even with "codec": "best_compression" in the index template. I think it's still too much.

stephenb · June 28, 2023, 1:35pm

Hmm still seems odd.

In the end your documents are about ~400 bytes each on disk that is completely normal.

You can drop some fields that you do not wish to keep which can reduce the size.

In general network/ netflow/ FW logs etc tend to be higher volume and can require significant disk space.

Still curious if you changed the ILM policy smaller the index the less efficient.

also you can run a force merge on the indices that will compact them

These to item might provide sma small amount of improvement.

curious What are you basing you thought that 400b per doc is too.much? For a fully parsed searchable and actionable data set?

BugS · July 2, 2023, 8:54pm

Hello @stephenb
As I mentioned earlier, it is difficult for me to estimate the exact volume of information that will be occupied after it is transformed into Elasticsearch indices. While I understand that the volume increases because the document not only contains the acquired data but also makes the information searchable, it's challenging for me to evaluate how much it should increase the final volume relative to its original size. This is because it is my first experience with the ELK stack.
I haven't yet been able to understand at which stage the "netflow" fields are mapped to document fields in elastic index and where exactly I can influence it (if such a possibility exists). Perhaps this is the root of the my problem.

stephenb · July 2, 2023, 9:31pm

Hi @BugS

This is a fairly common predicament, you are doing just fine, these are pretty common question.

But lets recap

Your events are ~400B per event/document on disk primary storage, double with replica. This is very normal for this type of data.
If you can estimate your Events / Sec or Events / Day etc then you can estimate your storage requirements.
Trying to compare it to whatever you are measuring is probably not very fruitful.

That ~400B per document includes the source JSON plus all the indexed, searchable, aggregatable fields which are contained is specialized/optimized data structures. The _source and the fields are different and both take up storage and both have value. I can not explain ALL of elastic to you but there is reason for both of these. See the bottom for my example

The _source is the human-readable JSON

The fields are what all the aggregations, detections, visualization etc use to support those capabilities.
Both are important.

the _source json is store compress and the fields are in optimized data structures

You can see the fields and the `_source`` by doing this command see the example at the bottom.

GET filebeat-*/_search
{
  "fields": ["*"]
}

There are ways to reduce storage with some technics but then there is always a tradeoff.

The basic flow is that

Filebeat Reads the UDP Packet and changes that to JSON (That definitely increases the size) and adds some fields about the host agent but not a lot for netflow

So the incoming UDP netflow event is converted to something like this from filebeat yup much bigger (this gets compressed when it eventually gets stored on Disk)

{"@timestamp":"2023-07-02T21:22:58.875Z","@metadata":{"beat":"filebeat","type":"_doc","version":"8.8.0","pipeline":"filebeat-8.8.0-netflow-log-pipeline"},"destination":{"locality":"internal","port":161,"ip":"172.30.190.10"},"network":{"transport":"udp","iana_number":17,"bytes":653,"packets":395,"direction":"unknown","community_id":"1:FLZTA4GQnkjs4vBnBIutg6/P3TA="},"tags":["forwarded"],"ecs":{"version":"1.12.0"},"flow":{"locality":"external","id":"Y-DBHw3QZxg"},"service":{"type":"netflow"},"input":{"type":"netflow"},"agent":{"type":"filebeat","version":"8.8.0","ephemeral_id":"ab48691b-c77b-4290-99a6-bf1fd8032f6a","id":"90b1bdf1-f0d0-4a5e-b3ce-4603c999bd5a","name":"hyperion"},"source":{"locality":"external","port":40,"bytes":653,"packets":395,"ip":"112.10.20.10"},"observer":{"ip":"192.168.2.108"},"related":{"ip":["112.10.20.10","172.30.190.10"]},"netflow":{"source_transport_port":40,"source_ipv4_prefix_length":6,"ip_class_of_service":0,"bgp_source_as_number":32456,"flow_start_sys_up_time":3103,"protocol_identifier":17,"type":"netflow_flow","packet_delta_count":395,"flow_end_sys_up_time":3127,"ip_next_hop_ipv4_address":"172.199.15.1","egress_interface":0,"bgp_destination_as_number":57043,"source_ipv4_address":"112.10.20.10","destination_ipv4_prefix_length":0,"destination_ipv4_address":"172.30.190.10","tcp_control_bits":0,"ingress_interface":0,"octet_delta_count":653,"destination_transport_port":161,"exporter":{"uptime_millis":3307,"address":"192.168.2.108:49455","engine_type":1,"engine_id":0,"sampling_interval":0,"version":5,"timestamp":"2023-07-02T21:22:58.875Z"}},"event":{"category":["network"],"action":"netflow_flow","end":"2023-07-02T21:22:58.695Z","created":"2023-07-02T21:22:58.876Z","module":"netflow","type":["connection"],"start":"2023-07-02T21:22:58.671Z","duration":24000000,"kind":"event","dataset":"netflow.log"},"fileset":{"name":"log"}}

Then it sends that JSON Paylod to Elasticsearch where it runs against an ingest Pipeline which parses the message / data etc... and then enrichs it with GeoLocation Data etc..etc..
the document gets Index (i.e. written to disk) the _source basically gets zipped / compressed, the fields are mapped to the correct types and stored in the optimized data structures...

GET filebeat-*/_search
{
  "fields": ["*"]
}

All this data below takes up 400bytes on disk... so it is pretty incredibly efficient...

      {
        "_index": ".ds-filebeat-8.8.0-2023.06.15-000001",
        "_id": "R8i864gBMo9gMp_Zlq38",
        "_score": 1,
        "_source": {
          "agent": {
            "name": "hyperion",
            "id": "90b1bdf1-f0d0-4a5e-b3ce-4603c999bd5a",
            "ephemeral_id": "0b0bbf68-e2b4-4461-ac63-581d2e9a14cf",
            "type": "filebeat",
            "version": "8.8.0"
          },
          "destination": {
            "port": 80,
            "ip": "172.30.190.10",
            "locality": "internal"
          },
          "source": {
            "geo": {
              "continent_name": "Asia",
              "country_iso_code": "CN",
              "country_name": "China",
              "location": {
                "lon": 113.722,
                "lat": 34.7732
              }
            },
            "as": {
              "number": 56041,
              "organization": {
                "name": "China Mobile communications corporation"
              }
            },
            "port": 40,
            "bytes": 147,
            "ip": "112.10.20.10",
            "locality": "external",
            "packets": 99
          },
          "fileset": {
            "name": "log"
          },
          "network": {
            "community_id": "1:2XUebCsnI3VlL8fg4KKToSenV5Q=",
            "bytes": 147,
            "transport": "tcp",
            "packets": 99,
            "iana_number": 6,
            "direction": "inbound"
          },
          "tags": [
            "forwarded"
          ],
          "input": {
            "type": "netflow"
          },
          "observer": {
            "ip": "192.168.2.108"
          },
          "netflow": {
            "destination_ipv4_prefix_length": 24,
            "source_ipv4_prefix_length": 8,
            "packet_delta_count": 99,
            "protocol_identifier": 6,
            "bgp_destination_as_number": 63823,
            "flow_start_sys_up_time": 215242,
            "octet_delta_count": 147,
            "egress_interface": 0,
            "bgp_source_as_number": 35233,
            "type": "netflow_flow",
            "ip_next_hop_ipv4_address": "172.199.15.1",
            "destination_ipv4_address": "172.30.190.10",
            "source_ipv4_address": "112.10.20.10",
            "exporter": {
              "uptime_millis": 215450,
              "engine_type": 1,
              "address": "192.168.2.108:61755",
              "engine_id": 0,
              "version": 5,
              "sampling_interval": 0,
              "timestamp": "2023-06-24T04:49:44.203Z"
            },
            "tcp_control_bits": 0,
            "ip_class_of_service": 0,
            "ingress_interface": 0,
            "flow_end_sys_up_time": 215280,
            "source_transport_port": 40,
            "destination_transport_port": 80
          },
          "@timestamp": "2023-06-24T04:49:44.203Z",
          "related": {
            "ip": [
              "112.10.20.10",
              "172.30.190.10"
            ]
          },
          "ecs": {
            "version": "1.12.0"
          },
          "service": {
            "type": "netflow"
          },
          "event": {
            "duration": 38000000,
            "ingested": "2023-06-24T04:49:45.207061293Z",
            "created": "2023-06-24T04:49:44.203Z",
            "kind": "event",
            "module": "netflow",
            "start": "2023-06-24T04:49:43.995Z",
            "action": "netflow_flow",
            "end": "2023-06-24T04:49:44.033Z",
            "category": [
              "network"
            ],
            "type": [
              "connection"
            ],
            "dataset": "netflow.log"
          },
          "flow": {
            "locality": "external",
            "id": "ZFU2p8Lb-eU"
          }
        },
        "fields": {
          "flow.id": [
            "ZFU2p8Lb-eU"
          ],
          "event.category": [
            "network"
          ],
          "netflow.exporter.sampling_interval": [
            0
          ],
          "traefik.access.geoip.location": [
            {
              "coordinates": [
                113.722,
                34.7732
              ],
              "type": "Point"
            }
          ],
          "netflow.ip_class_of_service": [
            0
          ],
          "service.type": [
            "netflow"
          ],
          "netflow.tcp_control_bits": [
            0
          ],
          "netflow.source_transport_port": [
            40
          ],
          "netflow.exporter.version": [
            5
          ],
          "netflow.exporter.address": [
            "192.168.2.108:61755"
          ],
          "netflow.bgp_source_as_number": [
            35233
          ],
          "netflow.destination_ipv4_prefix_length": [
            24
          ],
          "source.ip": [
            "112.10.20.10"
          ],
          "agent.name": [
            "hyperion"
          ],
          "network.community_id": [
            "1:2XUebCsnI3VlL8fg4KKToSenV5Q="
          ],
          "event.kind": [
            "event"
          ],
          "source.packets": [
            99
          ],
          "network.packets": [
            99
          ],
          "netflow.flow_start_sys_up_time": [
            215242
          ],
          "netflow.destination_ipv4_address": [
            "172.30.190.10"
          ],
          "fileset.name": [
            "log"
          ],
          "flow.locality": [
            "external"
          ],
          "traefik.access.geoip.country_iso_code": [
            "CN"
          ],
          "netflow.source_ipv4_prefix_length": [
            8
          ],
          "input.type": [
            "netflow"
          ],
          "agent.hostname": [
            "hyperion"
          ],
          "tags": [
            "forwarded"
          ],
          "source.port": [
            40
          ],
          "agent.id": [
            "90b1bdf1-f0d0-4a5e-b3ce-4603c999bd5a"
          ],
          "ecs.version": [
            "1.12.0"
          ],
          "event.created": [
            "2023-06-24T04:49:44.203Z"
          ],
          "network.iana_number": [
            "6"
          ],
          "agent.version": [
            "8.8.0"
          ],
          "event.start": [
            "2023-06-24T04:49:43.995Z"
          ],
          "source.as.number": [
            56041
          ],
          "observer.ip": [
            "192.168.2.108"
          ],
          "netflow.source_ipv4_address": [
            "112.10.20.10"
          ],
          "netflow.type": [
            "netflow_flow"
          ],
          "netflow.exporter.engine_id": [
            0
          ],
          "destination.port": [
            80
          ],
          "netflow.bgp_destination_as_number": [
            63823
          ],
          "netflow.flow_end_sys_up_time": [
            215280
          ],
          "event.end": [
            "2023-06-24T04:49:44.033Z"
          ],
          "netflow.octet_delta_count": [
            147
          ],
          "source.geo.location": [
            {
              "coordinates": [
                113.722,
                34.7732
              ],
              "type": "Point"
            }
          ],
          "agent.type": [
            "filebeat"
          ],
          "event.module": [
            "netflow"
          ],
          "related.ip": [
            "112.10.20.10",
            "172.30.190.10"
          ],
          "source.geo.country_iso_code": [
            "CN"
          ],
          "netflow.ingress_interface": [
            0
          ],
          "network.bytes": [
            147
          ],
          "netflow.packet_delta_count": [
            99
          ],
          "netflow.exporter.engine_type": [
            1
          ],
          "network.direction": [
            "inbound"
          ],
          "netflow.ip_next_hop_ipv4_address": [
            "172.199.15.1"
          ],
          "netflow.exporter.uptime_millis": [
            215450
          ],
          "source.bytes": [
            147
          ],
          "destination.locality": [
            "internal"
          ],
          "source.as.organization.name.text": [
            "China Mobile communications corporation"
          ],
          "netflow.destination_transport_port": [
            80
          ],
          "netflow.exporter.timestamp": [
            "2023-06-24T04:49:44.203Z"
          ],
          "source.as.organization.name": [
            "China Mobile communications corporation"
          ],
          "source.geo.continent_name": [
            "Asia"
          ],
          "traefik.access.geoip.continent_name": [
            "Asia"
          ],
          "source.locality": [
            "external"
          ],
          "destination.ip": [
            "172.30.190.10"
          ],
          "network.transport": [
            "tcp"
          ],
          "event.duration": [
            38000000
          ],
          "netflow.protocol_identifier": [
            6
          ],
          "event.ingested": [
            "2023-06-24T04:49:45.207Z"
          ],
          "event.action": [
            "netflow_flow"
          ],
          "@timestamp": [
            "2023-06-24T04:49:44.203Z"
          ],
          "event.type": [
            "connection"
          ],
          "agent.ephemeral_id": [
            "0b0bbf68-e2b4-4461-ac63-581d2e9a14cf"
          ],
          "source.geo.country_name": [
            "China"
          ],
          "event.dataset": [
            "netflow.log"
          ],
          "netflow.egress_interface": [
            0
          ]
        }
      },

Hope this helps a bit ... in the end you can drop some fields if you want... our you can even drop the source but that is not recommended unless you really know what you are doing ...

In Summary:

Your Events are about 400b on Disk, which is completely normal
The easiest way to estimate your storage is run some POC and interpolate from there based on your Volumen of Events
This is netflow data, it tends to be high volume, high volume = disk space.
If you were building a production Elastic System you would learn about data Tiers and ILM (move old data to more cost efficient storage and even potentially to blob etc)
There is a lot to consider

BugS · July 16, 2023, 12:22pm

@stephenb
Thank you for the detailed answer Stephen. Now I can confidently justify the need for additional disk space As you correctly noted, I changed the ILM policy, but it was done temporarily for testing purposes. However, to be honest, I don't fully understand how to choose the optimal values for it. Perhaps there are some useful articles on this topic?

system · August 13, 2023, 12:22pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Real netflow monitoring for 10 G traffic Elasticsearch	4	585	July 5, 2017
Unable to upload netflow data to Elasticsearch with filebeat netflow module Beats filebeat	9	2784	February 24, 2020
Netflow module Logstash	11	2527	June 7, 2018
Netflow data not appears in Elastic search/Kibana Kibana	2	442	February 8, 2023
Reducing Disk Space Requirements/ Deduplication? Zipping? Elasticsearch	5	2331	July 6, 2017

Collection and storage of information from netflow sources

Related topics