Dealing with highly scored sequences of words

ivanibash · January 21, 2020, 5:18pm

I have an elastic search filter which for the sake of argument looks like this

"french_company_synonyms": {
                            "expand": "true",
                            "type": "synonym_graph",
                            "synonyms": [
                                "llp, limited liability partnership",
                                "llc, limited liability company",
                                "plc, public limited company",
                                "sarl, societe a responsabilite limitee",
                                "sa, societe anonyme"
                            ]
                        }

As a result, when someone types in "sarl", it expands to "societe a responsabilite limitee". So far so good.
Now, in my index I have records that use both "societe a responsabilite limitee" and "sarl". Let's say I'm looking for "sarl dream". In my db it's stored exactly like that, "sarl dream". However, even though there is an exact match, instead it would first return LOTS of "Societe a responsabilite limitee *" companies, because I suspect it expands the query, sees that there are records that match 4 words, and scores those higher than the exact match "sarl dream" (only 2 words match).

If I'm understanding the problem correctly, I'd formalise it something like this. I have many records in the index with the same combination of words ("societe a responsabilite limitee"). Usually ES is good at penalising words that appear often in the index through tf/idf. However in this case this seems to be offset by the fact that it's a phrase with multiple words. And even though they are common, the matches still score high. How do you deal with the cases where it's not simply a word that's very common in a db but a phrase/word sequence?

Now I see 2 potential solutions. I can maybe use some kind of downscoring thing where I downscore the matches that have a phrase "Societe a responsabilite limitee" (I think I need to use boosting query for that). This is relatively easy but seems a bit dirty, I think the boosting score would need to change as the index grows. Another way is to ensure that my index doesn't have any "societe a responsabilite limitee", and all these are normalised to a word "sarl").

Before I go down the rabbit hole of trying things out, can someone tell me if they encountered a similar problem, and also whether my problem definition is even correct?

Thank you.

system · February 18, 2020, 5:18pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to increase score for exact word/phrase match in elastic search? Elasticsearch	11	15868	July 9, 2019
Scoring Problem with words of one letter Elasticsearch	4	811	July 3, 2018
Multi-match query : how to improve results? Elasticsearch	2	374	July 6, 2017
Multiple synonyms contribute to the score Elasticsearch	5	913	July 6, 2017
ES gives very different scores, in match_phrase_prefix, for similar documents even I use DfsQueryThenFetch Elasticsearch	1	432	July 6, 2017

Dealing with highly scored sequences of words

Related topics