Data set difference between fields on different indexes


#1

Hi, per my previous post, I ended up creating two separate indexes (Index 1, Index 2) sharing one common field dstIP.

I'd like now, let's say on an hourly basis, to get a list of all the IP addresses that exist in Index 1 but not in Index 2.

What is the best way of accomplishing this? A scripted filter, a DSL filter query?

Can the outcome be visualized for further understanding of where these IPs come from?

Thanks.


(Abdon Pijpelink) #2

I would solve this with aggregations rather than queries. You start with a composite aggregation, to get a list of all unique IPs across the two indexes.

Nested inside of that you use a terms aggregation on the _index meta field to get a bucket for each index in which each IP exists. IPs that occur in both indexes will get two buckets. IPs that occur in only one index get one bucket.

Finally, you can use a bucket_selector pipeline aggregation to filter out those IP addresses for which there are two buckets. Or, in other words, retrieve only the IPs that occur in only one index.

That request would look like this:

GET index1,index2/_search
{
  "size": 0,
  "aggs": {
    "all_ips": {
      "composite": {
        "sources": [
          {
            "dstIP": {
              "terms": {
                "field": "dstIP"
              }
            }
          }
        ]
      },
      "aggs": {
        "indexes_per_ip": {
          "terms": {
            "field": "_index",
            "size": 2
          }
        },
        "index_count_bucket_filter": {
          "bucket_selector": {
            "buckets_path": {
              "index_count": "indexes_per_ip._bucket_count"
            },
            "script": "params.index_count == 1"
          }
        }
      }
    }
  }
}

#3

Thank you @abdon for your reply.

I ran the query, replacing index1 and index2 with the daily indexes for my two data logs.

The result I got was:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 240139,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "all_ips" : {
      "after_key" : {
        "dstIP" : "5.189.xx.xx"
      },
      "buckets" : [
        {
          "key" : {
            "dstIP" : "5.135.xx.xx"
          },
          "doc_count" : 1,
          "indexes_per_ip" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "filebeat-name-of-index1-2018.12.08",
                "doc_count" : 1
              }
            ]
          }
        }
      ]
    }
  }
}

The first dstIP, 5.189.x.x mentioned appears on both indexes; second one is unique to index_1.
Is this how this was meant to work?
Suprised that index2 does not have more unique to it IP addresses since my Unique IP graphs per index shows different numbers (see below). Also, funny to notice that despite setting the time from 7pm EST time to 6:59pm EST the next day, index_2 which includes network data and has likely a delay, seems to be spilling over the index of the next day.


(Abdon Pijpelink) #4

The composite aggregation does not return all results at once. It allows you to page through the buckets. The after_key that's returned is something you can use to get the next page of buckets, using the after parameter in a subsequent request:

GET logs_server*/_search
{
  "size": 0,
  "aggs": {
    "all_ips": {
      "composite": {
        "sources": [
          {
            "dstIP": {
              "terms": {
                "field": "geoip.city_name.keyword"
              }
            }
          }
        ],
        "after": { "dstIP" : "5.189.xx.xx" }
      },
      "aggs": {
        "indexes_per_ip": {
          "terms": {
            "field": "_index",
            "size": 2
          }
        },
        "index_count_bucket_filter": {
          "bucket_selector": {
            "buckets_path": {
              "index_count": "indexes_per_ip._bucket_count"
            },
            "script": "params.index_count == 1"
          }
        }
      }
    }
  }
}

To make paging more efficient, you could set the size parameter in the composite aggregation to for example 1000.


#5

@abdon, thanks so much. I wonder, is there any way to save this aggregation within Kibana so that I can do visualizations on top of it?


(Abdon Pijpelink) #6

As far as I know, you can't visualize an aggregation like this with regular Kibana visualizations. You can try building a Vega visualization, but those have a bit of a learning curve. This tutorial is a good way to get started.