Jaccard similarities

vsaraph · July 21, 2022, 4:58pm

I have an index, say attributes, whose documents all have a field, say items, which is an array of strings. I want to be able to take an array of strings, and write an elasticsearch query that gives me back all documents in attributes whose items have a high enough Jaccard similarity with array I passed, as well as the Jaccard similarity score.

I've managed to write a fairly contrived terms_set query to do this, I just needed to add an items_count field, at index time, that stores the length of items as well:

"query": {
    "terms_set": {
      "items": {
        "terms": ["item1", item2", "item3"],
        "minimum_should_match_script": {
          "source": "params['thresh'] * (params.num_terms + doc['items_count'].value) / (1 + params['thresh'])",
          "params": {
            "thresh": 0.3
          }
        }
      }
    }
  }

One can change thresh to vary the similarity threshold to filter by. This doesn't give me back the Jaccard similarity of each document though. Is there a way I can do this, or is there a better way of computing Jaccard similarities in general, maybe with a painless script?

system · August 18, 2022, 4:58pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch - How to find similarity between 2 arrays? Elasticsearch	5	941	April 5, 2019
Indexing a large Nx N matrix of similarity with ES Elasticsearch	4	689	July 6, 2017
Access similarity in java native script Elasticsearch	1	350	July 5, 2017
Need help on similarity ranking approach Elasticsearch	9	516	July 6, 2017
Topic Modeling Similarity Elasticsearch	2	1700	July 6, 2017

Jaccard similarities

Related topics