How to Execute Query to fetch Intersection of Two DataSets


(Kpcool) #1

I have an index 'analytics', which contains a list of events ( for eg: CRUD) that occured over a period of time. I am looking to find a set of records that were added and deleted by primary key.

document structure:

id, key, event, timestamp

where key is primary key of record, event is 'create', 'read', 'delete', 'update'.

I want to find the list of primary keys that were both 'created' and 'deleted'. Basically an intersection of two sets ('created') and ('deleted') over the primary key.

I can't seem to get ahead with this.


(Shaunak Kashyap) #2

You could try a query with aggregations like this:

{
  "size": 0, 
  "query": {
    "bool": {
      "filter": {
        "terms": {
          "event": [
            "create",
            "delete"
          ]
        }
      }
    }
  },
  "aggs": {
    "same_key": {
      "terms": {
        "field": "key",
        "min_doc_count": 2, 
        "size": 100
      }
    }
  }
} 

This query first filters documents that are only create or delete events. Then it aggregates these documents by key. You want only those keys that have both these events, hence the min_doc_count value is 2.

You may want to tweak the size in the terms aggregation (set to 100 above) per your needs.

BTW, the above syntax works for Elasticsearch 2.x. For older versions of Elasticsearch, you will need to use the filtered query instead of the bool query but everything else will remain the same.


(Raj) #3

Generally I use |A intersect B| = |A| + |B| - | A U B|, but you have to be careful if you are doing cardinality aggregations, as you can get negative values.


(system) #4