Does Elasticsearch score different length shingles with the same IDF?

Brett_Anderson · April 4, 2018, 1:11am

In Elasticsearch 5.6.5 I'm searching against a field with the following filter applied:

    "filter_shingle":{  
       "max_shingle_size":"4",
       "min_shingle_size":"2",
       "output_unigrams":"true",
       "type":"shingle"
    }

When I perform a search for depreciation tax against a document with that exact text, I see the following explanation of the score:

    weight(Synonym(content:depreciation content:depreciation tax)) .... [7.65]
    weight(content:tax) ... [6.02]

If I change the search to depreciation taffy against the exact same document with depreciation tax in the content I get this explanation:

    weight(Synonym(content:depreciation content:depreciation taffy)) .... [7.64]

This is not what I expected. I thought a match on the bigram token for depreciation tax would get a much higher score than a match on the unigram depreciation. However this scoring seems to reflect a simple unigram match. There is an extremely small difference and digging further this is because the termFreq=28 under the depreciation taffy match, and termFreq=29 under the depreciation tax match. I'm also not sure how this relates as I imagine across the shard holding this document there are very different counts for depreciation, depreciation tax and depreciation tafffy

Is this expected behavior? Is ES treating all the different sized shingles, including unigrams, with the same IDF value? Do I need to split out each shingle size into different sub fields with different analyzers to get the behavior I expect?

polyfractal · April 6, 2018, 3:38pm

Yep, this is mostly expected.

It's not really the shingles causing the scoring oddness, but the fact that SynonymQueries do the frequency blending behavior that you're seeing. They use frequency of the original token for all the subsequent 'synonym' tokens, as a way to help prevent skewing the score results. Synonyms are often relatively rare, and would drastically affect the scoring if they each used their individual df's.

From the Lucene docs:

For scoring purposes, this query tries to score the terms as if you had indexed them as one term: it will match any of the terms but only invoke the similarity a single time, scoring the sum of all term frequencies for the document.

The SynonymQuery also sets the docFrequency to the maximum docFrequency of the terms in the document. So for example, if:

"deprecation"_df == 5
"deprecation tax"_df == 2,
"deprecation taffy"_df == 1,

it will use 5 as the docFrequency for scoring purposes.

The bigger issue is that Lucene doesn't have a way to differentiate shingles from synonyms... they both use tokens that overlap the position of other tokens in the token stream. So if unigrams are mixed with bi-(or larger)-grams, Lucene is tricked into thinking it's actually a synonym situation.

The fix is to keep your unigrams and bi-plus-grams in different fields. That way Lucene won't attempt to use SynonymQueries in these situations, because the positions won't be overlapping anymore.

Brett_Anderson · April 6, 2018, 9:32pm

Thanks for the info. I started reconfiguring the index to use different fields and I can see the expected results now. It's good to know that this is expected behavior and I'm not implementing things the wrong way.

I find it a little odd that most of the Elasticsearch documentation around Shingles doesn't seem to address this and many examples use a single shingle filter to cover unigrams, bigrams and up. When shingles are treated as synonyms I feel much of the expected relevance boosting is missing.

I asked this questions over on Stackoverflow too, feel free to copy your response there as well. https://stackoverflow.com/questions/49546149/does-elasticsearch-score-different-length-shingles-with-the-same-idf

Thanks for your help.

polyfractal · April 9, 2018, 3:37pm

Np, happy to help!

Re: documentation... I haven't looked closely yet, but I suspect that's just a "bug" and they out of date. The blended query behavior is reasonably new in Lucene, so it's possible they just weren't updated. I'll take a look around and see what we can fix. I agree it's a confusing situation as it currently stands/documented.

Thanks!

system · May 7, 2018, 3:38pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.