I have an index, say attributes
, whose documents all have a field, say items
, which is an array of strings. I want to be able to take an array of strings, and write an elasticsearch query that gives me back all documents in attributes
whose items
have a high enough Jaccard similarity with array I passed, as well as the Jaccard similarity score.
I've managed to write a fairly contrived terms_set
query to do this, I just needed to add an items_count
field, at index time, that stores the length of items
as well:
"query": {
"terms_set": {
"items": {
"terms": ["item1", item2", "item3"],
"minimum_should_match_script": {
"source": "params['thresh'] * (params.num_terms + doc['items_count'].value) / (1 + params['thresh'])",
"params": {
"thresh": 0.3
}
}
}
}
}
One can change thresh
to vary the similarity threshold to filter by. This doesn't give me back the Jaccard similarity of each document though. Is there a way I can do this, or is there a better way of computing Jaccard similarities in general, maybe with a painless script?