Boolean similarity - is there a way to remove duplicates

Miljenko_Norsic · October 30, 2020, 11:44am

Given the following index:

{
    "mappings": {
        "properties": {
        "field1": { 
            "type": "text",
            "analyzer": "whitespace",
            "similarity": "boolean"
        },
        "field2": { 
            "type": "text",
            "analyzer": "whitespace",
            "similarity": "boolean"
        }
        }
    }
}

And the following data in it:

{ "index" : {} }
{ "field1": "foo", "field2": "bar"}
{ "index" : {} }
{ "field1": "foo1 foo2", "field2": "bar1 bar2"}
{ "index" : {} }
{ "field1": "foo1 foo2 foo3", "field2": "bar1 bar2 bar3"}

For the given Boolean query:

{
    "size": 10,
    "min_score": 0.4,
    "query": {
        "function_score": {
        "query": {
            "bool": {
            "should": [
                {
                "fuzzy":{
                    "field1":{
                        "value":"foo",
                        "fuzziness":"AUTO",
                        "boost": 1
                    }
                }
            },
            {
                "fuzzy":{
                    "field2":{
                        "value":"bar",
                        "fuzziness":"AUTO",
                        "boost": 1
                    }
                }
            }
            ]
            }
        }
        }
    }
}

I'm always receiving ["foo1 foo2 foo3", "bar1 bar2 bar3"] despite the fact that there is an exact result in index (the first one):

{
    "took": 114,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 3,
            "relation": "eq"
        },
        "max_score": 3.9999998,
        "hits": [
            {
                "_index": "test_index2",
                "_type": "_doc",
                "_id": "bXw8eXUBCTtfNv84bNPr",
                "_score": 3.9999998,
                "_source": {
                    "field1": "foo1 foo2 foo3",
                    "field2": "bar1 bar2 bar3"
                }
            },
            {
                "_index": "test_index2",
                "_type": "_doc",
                "_id": "bHw8eXUBCTtfNv84bNPr",
                "_score": 2.6666665,
                "_source": {
                    "field1": "foo1 foo2",
                    "field2": "bar1 bar2"
                }
            },
            {
                "_index": "test_index2",
                "_type": "_doc",
                "_id": "a3w8eXUBCTtfNv84bNPr",
                "_score": 2.0,
                "_source": {
                    "field1": "foo",
                    "field2": "bar"
                }
            }
        ]
    }
}

I'm aware of the fact that Boolean works that way to match as many results, and I know I can do rescoring here, but this is not an option since I don't know how many top N results to fetch.

Are there any other options here? Maybe to create my own similarity plugin based on Boolean similarity to remove duplicates and leave the best matched token, but I don't know where to start from, I see only samples for script and rescore.

Thanks in advance.

Miljenko_Norsic · November 9, 2020, 7:34am

Maybe I was a bit unclear with my question, so let's put it this way: are there any fresh examples on how to implement your own similarity module, possibly on a basis od Boolean similarity?

Mark_Harwood · November 9, 2020, 9:48am

This is likely because of the term frequency (TF) in the doc. The repetition of the search terms in the document outweigh any exact matches with no repetition.

is there a way to remove duplicates

Elastic Stack Elasticsearch

Generally speaking it makes sense to take a user's search string and run it using a combination of strict and fuzzier query clause types. This can be done in roughly two ways:

Prioritising recall - use a single bool query with a should array to combine multiple strict and fuzzier clauses. The docs that match most clauses should rank highest. The problem is there may be a "long tail" of weak matches that show up in any facets or if the user chooses to sort by anything other than relevance e.g. price.
Prioritising precision - issue multiple search requests, starting with the strictest clause and then "falling back" to using fuzzier queries only if there are insufficient results. This may take longer to run but helps ensure results are free of irrelevant weak matches when there's no need for them.

Miljenko_Norsic · November 9, 2020, 1:42pm

This is likely because of the term frequency (TF) in the doc. The repetition of the search terms in the document outweigh any exact matches with no repetition.

But the documentation says that Boolean similarity returns a score that is based on whether the query terms match or not, so I assumed that has nothing to do with TF.

Thanks for the options mentioned, but is there a way to extend default Boolean similarity and implement my own similarity that does exactly what I want? Because option 1 may still result with a suboptimal result, and option 2 can impose longer runs.

The problem I'm solving with Elasticsearch is to fuzzy match postal address, given the fact that addresses and receivers can be misspelled, there are aliases for addresses/receivers etc, and the main problem I'm dealing with is interference between primary and alternative town and street names.

system · December 7, 2020, 1:42pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Boolean similarity module with fuzzy search scoring Elasticsearch	1	404	August 14, 2020
I'm using "match-Boolean-prefix query but I can't get the exact match of the query Elasticsearch	1	260	September 16, 2022
Help with Boolean query Elasticsearch	1	305	March 21, 2020
Bool Query giving inappropriate results Elasticsearch	5	344	April 9, 2020
Control fuzziness in a bool query Elasticsearch	1	535	July 6, 2017

Boolean similarity - is there a way to remove duplicates

Related topics