Boolean similarity - is there a way to remove duplicates

Given the following index:

{
    "mappings": {
        "properties": {
        "field1": { 
            "type": "text",
            "analyzer": "whitespace",
            "similarity": "boolean"
        },
        "field2": { 
            "type": "text",
            "analyzer": "whitespace",
            "similarity": "boolean"
        }
        }
    }
}

And the following data in it:

{ "index" : {} }
{ "field1": "foo", "field2": "bar"}
{ "index" : {} }
{ "field1": "foo1 foo2", "field2": "bar1 bar2"}
{ "index" : {} }
{ "field1": "foo1 foo2 foo3", "field2": "bar1 bar2 bar3"}

For the given Boolean query:

{
    "size": 10,
    "min_score": 0.4,
    "query": {
        "function_score": {
        "query": {
            "bool": {
            "should": [
                {
                "fuzzy":{
                    "field1":{
                        "value":"foo",
                        "fuzziness":"AUTO",
                        "boost": 1
                    }
                }
            },
            {
                "fuzzy":{
                    "field2":{
                        "value":"bar",
                        "fuzziness":"AUTO",
                        "boost": 1
                    }
                }
            }
            ]
            }
        }
        }
    }
}

I'm always receiving ["foo1 foo2 foo3", "bar1 bar2 bar3"] despite the fact that there is an exact result in index (the first one):

{
    "took": 114,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 3,
            "relation": "eq"
        },
        "max_score": 3.9999998,
        "hits": [
            {
                "_index": "test_index2",
                "_type": "_doc",
                "_id": "bXw8eXUBCTtfNv84bNPr",
                "_score": 3.9999998,
                "_source": {
                    "field1": "foo1 foo2 foo3",
                    "field2": "bar1 bar2 bar3"
                }
            },
            {
                "_index": "test_index2",
                "_type": "_doc",
                "_id": "bHw8eXUBCTtfNv84bNPr",
                "_score": 2.6666665,
                "_source": {
                    "field1": "foo1 foo2",
                    "field2": "bar1 bar2"
                }
            },
            {
                "_index": "test_index2",
                "_type": "_doc",
                "_id": "a3w8eXUBCTtfNv84bNPr",
                "_score": 2.0,
                "_source": {
                    "field1": "foo",
                    "field2": "bar"
                }
            }
        ]
    }
}

I'm aware of the fact that Boolean works that way to match as many results, and I know I can do rescoring here, but this is not an option since I don't know how many top N results to fetch.

Are there any other options here? Maybe to create my own similarity plugin based on Boolean similarity to remove duplicates and leave the best matched token, but I don't know where to start from, I see only samples for script and rescore.

Thanks in advance.

Maybe I was a bit unclear with my question, so let's put it this way: are there any fresh examples on how to implement your own similarity module, possibly on a basis od Boolean similarity?

This is likely because of the term frequency (TF) in the doc. The repetition of the search terms in the document outweigh any exact matches with no repetition.

is there a way to remove duplicates

Elastic StackElasticsearch

Generally speaking it makes sense to take a user's search string and run it using a combination of strict and fuzzier query clause types. This can be done in roughly two ways:

  1. Prioritising recall - use a single bool query with a should array to combine multiple strict and fuzzier clauses. The docs that match most clauses should rank highest. The problem is there may be a "long tail" of weak matches that show up in any facets or if the user chooses to sort by anything other than relevance e.g. price.
  2. Prioritising precision - issue multiple search requests, starting with the strictest clause and then "falling back" to using fuzzier queries only if there are insufficient results. This may take longer to run but helps ensure results are free of irrelevant weak matches when there's no need for them.

This is likely because of the term frequency (TF) in the doc. The repetition of the search terms in the document outweigh any exact matches with no repetition.

But the documentation says that Boolean similarity returns a score that is based on whether the query terms match or not, so I assumed that has nothing to do with TF.

Thanks for the options mentioned, but is there a way to extend default Boolean similarity and implement my own similarity that does exactly what I want? Because option 1 may still result with a suboptimal result, and option 2 can impose longer runs.

The problem I'm solving with Elasticsearch is to fuzzy match postal address, given the fact that addresses and receivers can be misspelled, there are aliases for addresses/receivers etc, and the main problem I'm dealing with is interference between primary and alternative town and street names.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.