Jaccard similarities

I have an index, say attributes, whose documents all have a field, say items, which is an array of strings. I want to be able to take an array of strings, and write an elasticsearch query that gives me back all documents in attributes whose items have a high enough Jaccard similarity with array I passed, as well as the Jaccard similarity score.

I've managed to write a fairly contrived terms_set query to do this, I just needed to add an items_count field, at index time, that stores the length of items as well:

"query": {
    "terms_set": {
      "items": {
        "terms": ["item1", item2", "item3"],
        "minimum_should_match_script": {
          "source": "params['thresh'] * (params.num_terms + doc['items_count'].value) / (1 + params['thresh'])",
          "params": {
            "thresh": 0.3
          }
        }
      }
    }
  }

One can change thresh to vary the similarity threshold to filter by. This doesn't give me back the Jaccard similarity of each document though. Is there a way I can do this, or is there a better way of computing Jaccard similarities in general, maybe with a painless script?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.