Does Elasticsearch score different length shingles with the same IDF?

polyfractal · April 6, 2018, 3:38pm

Yep, this is mostly expected.

It's not really the shingles causing the scoring oddness, but the fact that SynonymQueries do the frequency blending behavior that you're seeing. They use frequency of the original token for all the subsequent 'synonym' tokens, as a way to help prevent skewing the score results. Synonyms are often relatively rare, and would drastically affect the scoring if they each used their individual df's.

From the Lucene docs:

For scoring purposes, this query tries to score the terms as if you had indexed them as one term: it will match any of the terms but only invoke the similarity a single time, scoring the sum of all term frequencies for the document.

The SynonymQuery also sets the docFrequency to the maximum docFrequency of the terms in the document. So for example, if:

"deprecation"_df == 5
"deprecation tax"_df == 2,
"deprecation taffy"_df == 1,

it will use 5 as the docFrequency for scoring purposes.

The bigger issue is that Lucene doesn't have a way to differentiate shingles from synonyms... they both use tokens that overlap the position of other tokens in the token stream. So if unigrams are mixed with bi-(or larger)-grams, Lucene is tricked into thinking it's actually a synonym situation.

The fix is to keep your unigrams and bi-plus-grams in different fields. That way Lucene won't attempt to use SynonymQueries in these situations, because the positions won't be overlapping anymore.

Topic		Replies	Views
Synonyms result scoring Elasticsearch	5	3609	December 8, 2018
Unexpected Shingle Behaviour Elasticsearch	1	324	March 25, 2021
Synonym_graph + match_phrase: Unexpected high score due to the sum up of IDF of all matched synonym words Elasticsearch	1	403	October 1, 2021
Different score for exact same keyword Elasticsearch	5	4141	July 6, 2017
Weird scoring when using multi word synonyms Elasticsearch	7	1948	December 13, 2018

Does Elasticsearch score different length shingles with the same IDF?

Related topics