Data set difference between fields on different indexes

nandrik · December 8, 2018, 7:41pm

Hi, per my previous post, I ended up creating two separate indexes (Index 1, Index 2) sharing one common field dstIP.

I'd like now, let's say on an hourly basis, to get a list of all the IP addresses that exist in Index 1 but not in Index 2.

What is the best way of accomplishing this? A scripted filter, a DSL filter query?

Can the outcome be visualized for further understanding of where these IPs come from?

Thanks.

abdon · December 10, 2018, 5:30pm

I would solve this with aggregations rather than queries. You start with a composite aggregation, to get a list of all unique IPs across the two indexes.

Nested inside of that you use a terms aggregation on the _index meta field to get a bucket for each index in which each IP exists. IPs that occur in both indexes will get two buckets. IPs that occur in only one index get one bucket.

Finally, you can use a bucket_selector pipeline aggregation to filter out those IP addresses for which there are two buckets. Or, in other words, retrieve only the IPs that occur in only one index.

That request would look like this:

GET index1,index2/_search
{
  "size": 0,
  "aggs": {
    "all_ips": {
      "composite": {
        "sources": [
          {
            "dstIP": {
              "terms": {
                "field": "dstIP"
              }
            }
          }
        ]
      },
      "aggs": {
        "indexes_per_ip": {
          "terms": {
            "field": "_index",
            "size": 2
          }
        },
        "index_count_bucket_filter": {
          "bucket_selector": {
            "buckets_path": {
              "index_count": "indexes_per_ip._bucket_count"
            },
            "script": "params.index_count == 1"
          }
        }
      }
    }
  }
}

nandrik · December 10, 2018, 8:55pm

Thank you @abdon for your reply.

I ran the query, replacing index1 and index2 with the daily indexes for my two data logs.

The result I got was:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 240139,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "all_ips" : {
      "after_key" : {
        "dstIP" : "5.189.xx.xx"
      },
      "buckets" : [
        {
          "key" : {
            "dstIP" : "5.135.xx.xx"
          },
          "doc_count" : 1,
          "indexes_per_ip" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "filebeat-name-of-index1-2018.12.08",
                "doc_count" : 1
              }
            ]
          }
        }
      ]
    }
  }
}

The first dstIP, 5.189.x.x mentioned appears on both indexes; second one is unique to index_1.
Is this how this was meant to work?
Suprised that index2 does not have more unique to it IP addresses since my Unique IP graphs per index shows different numbers (see below). Also, funny to notice that despite setting the time from 7pm EST time to 6:59pm EST the next day, index_2 which includes network data and has likely a delay, seems to be spilling over the index of the next day.

abdon · December 11, 2018, 7:52am

The composite aggregation does not return all results at once. It allows you to page through the buckets. The after_key that's returned is something you can use to get the next page of buckets, using the after parameter in a subsequent request:

GET logs_server*/_search
{
  "size": 0,
  "aggs": {
    "all_ips": {
      "composite": {
        "sources": [
          {
            "dstIP": {
              "terms": {
                "field": "geoip.city_name.keyword"
              }
            }
          }
        ],
        "after": { "dstIP" : "5.189.xx.xx" }
      },
      "aggs": {
        "indexes_per_ip": {
          "terms": {
            "field": "_index",
            "size": 2
          }
        },
        "index_count_bucket_filter": {
          "bucket_selector": {
            "buckets_path": {
              "index_count": "indexes_per_ip._bucket_count"
            },
            "script": "params.index_count == 1"
          }
        }
      }
    }
  }
}

To make paging more efficient, you could set the size parameter in the composite aggregation to for example 1000.

nandrik · December 11, 2018, 11:22pm

@abdon, thanks so much. I wonder, is there any way to save this aggregation within Kibana so that I can do visualizations on top of it?

abdon · December 13, 2018, 5:38pm

As far as I know, you can't visualize an aggregation like this with regular Kibana visualizations. You can try building a Vega visualization, but those have a bit of a learning curve. This tutorial is a good way to get started.

system · January 10, 2019, 5:38pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filtering data to different indexes based on source iP Logstash	2	903	February 5, 2019
Query documents present in both indexes Elasticsearch	3	1299	May 9, 2018
Unable to perform aggregation on 2 fields Elasticsearch	2	546	September 8, 2017
Compare field value of two different index Elasticsearch	2	379	June 14, 2018
How do I determine the difference between two indices? Elasticsearch	2	8990	June 15, 2018

Data set difference between fields on different indexes

Related topics