How is the system.network.in.dropped calculated?

Hi, I have noticed a machine has a "higher" rate of dropped packets vs other machines. This machine is about +1% packet loss vs other machines are way below 1%

I.e:
Machine 1: 14 dropped packets over 200 million
Machine 2: 2 million over 200 Million.

You see "dropped": 2750373. Is this number cumulative over the uptime of the machine? Or is that how many packets where dropped at that particular timestamp?

I run this query:

GET metricbeat-*/_search
{
  "size": 100,
  "_source": ["@timestamp", "system.network.in.dropped", "host.name"], 
    "query": {
        "query_string" : {
            "query" : "metricset.name:network AND host.name:XXXXXX-0001"
        }
    }
}

And I get...

{
  "took": 99,
  "timed_out": false,
  "_shards": {
    "total": 7,
    "successful": 7,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 118220,
    "max_score": 5.0651484,
    "hits": [
      {
        "_index": "metricbeat-6.4.2-2019.12.16",
        "_type": "doc",
        "_id": "Km65DW8Bbyfak3QNTgOk",
        "_score": 5.0651484,
        "_source": {
          "@timestamp": "2019-12-16T08:00:44.724Z",
          "system": {
            "network": {
              "in": {
                "dropped": 0
              }
            }
          },
          "host": {
            "name": "XXXXXX-0001"
          }
        }
      },
      {
        "_index": "metricbeat-6.4.2-2019.12.16",
        "_type": "doc",
        "_id": "K265DW8Bbyfak3QNTgOk",
        "_score": 5.0651484,
        "_source": {
          "@timestamp": "2019-12-16T08:00:44.724Z",
          "system": {
            "network": {
              "in": {
                "dropped": 2750373
              }
            }
          },
          "host": {
            "name": "XXXXXX-0001"
          }
        }
      }
      }
    ]
  }
}

Metricbeat probably gets that dropped packet count from the interface statistics (netstat in Linux). In Linux nterface stats are typically since boot, but can be reset in other ways, I think the same applies to Windows. You could look at https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-serialdiff-aggregation.html to graph the drops over time.

How a system knows packets are lost is something I haven't reviewed in several years, we talked about it in a Wireshark class, but google can help with that, for example: https://likegeeks.com/fix-packet-loss/

Debugging packet loss is probably a topic for another forum.

@javadevmtl It should be a monotonically increasing number, as packets are going across the wire, the system increments an integer for errors, packets, dropped packets, and bytes. Metricbeat samples this integer and records it to Elasticsearch. To view this as a rate you will need to apply a derivative pipeline aggregation inside a date histogram aggregation. If you need the total number of packets for a specific time period then you will need subtract the min from the max using a bucket script.

Here is an example of sampling the entire time range:

POST metricbeat-*/_search
{
  "size": 0,
  "aggs": {
    "hosts": {
      "terms": {
        "field": "host.name",
        "size": 10
      },
      "aggs": {
        "max": {
          "max": {
            "field": "system.network.out.dropped"
          }
        },
        "min": {
          "min": {
            "field": "system.network.out.dropped"
          }
        },
        "total": {
          "bucket_script": {
            "buckets_path": {
              "min": "min",
              "max": "max"
            },
            "script": "params.max - params.min"
          }
        }
      }
    }
  }
}

You will want to change the query limit this to a specific time range and host. The unfortunate part of bucket_scripts is that you have to run them inside a multi-bucket aggregation like date_histogram or a terms aggregation.

Hi, thanks I looked at the sample Kibana dashboard that Metricbeat installs and came up with something. I get just about 1 packet lost per second on the input. I can confirm just by running netstat -i every second or so.

From your query, here is what I get which basically just show the behaviour I have noticed...

    {
      "key": "XXXXXX-0001",
      "doc_count": 1374436,
      "min": {
        "value": 0
      },
      "max": {
        "value": 16
      },
      "total": {
        "value": 16
      }
    },
    {
      "key": "XXXXXX-0003",
      "doc_count": 1340805,
      "min": {
        "value": 0
      },
      "max": {
        "value": 57
      },
      "total": {
        "value": 57
      }
    },
    {
      "key": "YYYYYY-0002",
      "doc_count": 1168439,
      "min": {
        "value": 0
      },
      "max": {
        "value": 1208688
      },
      "total": {
        "value": 1208688
      }
    },
    {
      "key": "YYYYYY-0003",
      "doc_count": 1162877,
      "min": {
        "value": 0
      },
      "max": {
        "value": 1398010
      },
      "total": {
        "value": 1398010
      }
    }

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.