Array intersection metric

alepuccetti · April 6, 2018, 3:15pm

Hi everyone,

I use a filter query to select document only match certain ids in a lit. I would like to get back as metric the number of ids that matched that document but if I use cardinality aggregation I get the cardinality of the all ids in the documents. Example

Let say that I have 2 document. Document A has ids = ['000', '111', '222'] and document B has ids = ['111', '222']

By filtering for ids = ['000'] only document A will match the query and by executing a cardinality aggregation over ids the result is 3. I would like to get back only the number of ids matching the document in this case 1.

I am trying with scripted metric aggregation but with no luck. I get back only one value back per shard instead one per document.

Here is a sample of the query I am running:

{
  "size": 0,
  "query": {
    "bool": {
      "filter": {
        "terms": {
          "ids": [
            "id0",
            "id1"
          ]
        }
      }
    }
  },
  "aggs": {
    "match": {
      "scripted_metric": {
        "init_script": "params._agg.domain = ['id0', 'id1']",
        "map_script": "params._agg.stats = []; Map ref = [:] ; List matches = []; for (int i = 0; i < params._agg.domain.length; ++i) { for (d in doc['ids']) { if (params._agg.domain[i] == d) { frame.put('ref', doc['ref'][0]); ref.put('ids', []); params._agg.stats.add(ref); break;} } } return params._agg.stats;"
      }
    }
  }
}

Anyone has some idea how I can compute my metric?

abdon · April 11, 2018, 7:29am

You can let Elasticsearch compute this without having to resort to complicated scripting, by making use of scoring.

If instead of using a terms query, you take each ID and use a term query for that ID, wrapped in a constant_score query, you will get a score of 0 or 1 depending on whether that ID is a match for a given document. You can then combine all these term queries for all the IDs that you are interested in inside of a bool query's should clause, like this:

{
  "query": {
    "bool": {
      "should": [
        {
          "constant_score": {
            "filter": {
              "term": {
                "ids": {
                  "value": "000"
                }
              }
            }
          }
        },
        {
          "constant_score": {
            "filter": {
              "term": {
                "ids": {
                  "value": "111"
                }
              }
            }
          }
        }
      ]
    }
  }
}

The score of each document will now be the number of IDs that match for that document.

If you want to know the total number of IDs that matched across all documents, you can combine this query with a sum aggregation that sums all of the individual scores:

{
  "query": {
    "bool": {
      "should": [
        {
          "constant_score": {
            "filter": {
              "term": {
                "ids": {
                  "value": "000"
                }
              }
            }
          }
        },
        {
          "constant_score": {
            "filter": {
              "term": {
                "ids": {
                  "value": "111"
                }
              }
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "total_hits": {
      "sum": {
        "script": "_score"
      }
    }
  }
}

alepuccetti · April 11, 2018, 8:17am

Hi Abdon,

Thanks for the reply. I think we are close to the solution, probably I did not explain my self very well.

First, let's clear up some names:

ref_id is a reference id, one per document and it can be the same for multiple documents (different timestamp)
target_ids is an array of values on which we want to compute the intersection on.

I modified your solution with a nested aggregation to get back the result for each ref_id.

{
  "size": 0, 
  "query": {
    "bool": {
      "should": [
        {
          "constant_score": {
            "filter": {
              "term": {
                "target_ids": {
                  "value": "id0"
                }
              }
            }
          }
        },
        {
          "constant_score": {
            "filter": {
              "term": {
                "target_ids": {
                  "value": "id1"
                }
              }
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "bucket": {
      "terms": {
        "field": "ref_id",
        "size": 10
      },
      "aggs": {
        "total_hits": {
          "sum": {
            "script": "_score"
          }
        }
      }
    }
  }
}

I have multiple documents that same ref_id but different values for other filed (e.g. timestamp). Because of that the score that I get back is not the number of unique ids for each reference id but instead the sum of all the matches.

So instead of getting back 1 or 2 per each ref_id, I get back the sum of all the matches for each document. Something like how many times the target_ids are seen in total instead I want them to be distinct, once you saw it once you should not count it anymore.

I did some more work on the script and I finally got it working but soon I found the limitation that the script cannot be more than 16k characters long and I want to have lists of target_ids of 10k or more.

I was thinking about another solution, what if I would compute the intersection later and get back from the aggregation the unique list of the target_id per ref_id bucket? I make the inner aggregation a term aggregation and get back a dictionary with the target_ids but it will be restricted by the size I choose.

"aggs": {
    "bucket": {
      "terms": {
        "field": "ref_id",
        "size": 5
      },
      "aggs": {
        "term": {
          "terms": {
            "field": "target_ids",
            "size": 10
          }
        }
      }
    }
  }

Any idea?

alepuccetti · April 11, 2018, 10:13am

Actually, after a chat on IRC, I found a solution that is doing almost what I want. By running this query

{
  "size": 0, 
  "query": {
    "bool": {
      "filter": {
        "terms": {
          "target_ids": [
            "000",
            "111",
            "222"
          ]
        }
      }
    }
  },
  "aggs": {
    "bucket": {
      "terms": {
        "field": "ref_id",
        "size": 10
      },
      "aggs": {
        "metric": {
          "terms": {
            "field": "target_ids",
            "include": ["000", "111", "222"], 
            "size": 3
          }
        }
      }
    }
  }
}

I am only interested in the actual size of the target_ids terms aggregation. I understand that doing this from a custom application point of view is super easy (just count the key in the bucket) but I am trying . to plug this into Kibana to get back a sort of Cardinality count but only on the restricted value I passed in the filter and include.
Is there any way to get back from Elasticsearch the size of the target_ids terms aggregation instead the terms aggregation results?

I actually created a new metric in Kibana to run the restricted unique count on the passed filter but it seems that Kibana is able to handle metrics that return a single value per bucket instead of another aggregation. Another solution would be to have Kibana be able to handle responses that are not just a value but a more complex JSON.
But this is getting out of the main scope of my question.

alepuccetti · April 12, 2018, 9:53am

I opened a feature request on the Elasticsearch repo about this.

alepuccetti · April 27, 2018, 7:57am

The feature request to solve this problem was rejected.

I found a workaround using this query:

{
  "size": 0,
  "aggs": {
    "2": {
      "terms": {
        "field": "ref_id",
        "size": 5,
        "order": {
          "_count": "desc"
        }
      },
      "aggs": {
        "1": {
          "bucket_script": {
            "buckets_path": {
              "bucket_size": "restricted_terms._bucket_count"
            },
            "script": "params.bucket_size"
          }
        },
        "restricted_terms": {
          "terms": {
            "field": "target_ids",
            "include": [
              "000",
              "111",
              "222"
            ],
            "size": 3,
            "order": {
              "_term": "desc"
            }
          }
        }
      }
    }
  },
  "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        },
        {
          "bool": {
            "filter": {
              "terms": {
                "target_ids": [
                  "000",
                  "111",
                  "222"
                ]
              }
            }
          }
        }
      ]
    }
  }
}

However, the results cannot be sorted by _bucket_count this will ha e the side effect that some ref_id with a higher _bucket_count count will be cut out by ref_id with a higher doc_count.

I tried to use scripted_metric, although I was able to achieve the functionality I wanted the script length limitation of 16384 characters will not allow me to make this work in a real use case.

Does anyone have a better idea how to solve this problem?

system · May 25, 2018, 7:57am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can I calculate the cardinality of the _id field? Elasticsearch	3	1093	July 6, 2017
Nested cardinality values way off with filter? Elasticsearch	3	1811	July 6, 2017
Using the aggregation Framework using a large set of doc IDs as query? ( + bypassing the scoring part) Elasticsearch	4	411	July 6, 2017
Counting the number of buckets matching certain criteria Elasticsearch	3	8711	July 5, 2017
Elasticsearch does not compute repeated values in the metric aggregation Elasticsearch	5	394	July 6, 2017

Array intersection metric

Related topics