Yep, this is mostly expected.
It's not really the shingles causing the scoring oddness, but the fact that SynonymQueries do the frequency blending behavior that you're seeing. They use frequency of the original token for all the subsequent 'synonym' tokens, as a way to help prevent skewing the score results. Synonyms are often relatively rare, and would drastically affect the scoring if they each used their individual df's.
From the Lucene docs:
For scoring purposes, this query tries to score the terms as if you had indexed them as one term: it will match any of the terms but only invoke the similarity a single time, scoring the sum of all term frequencies for the document.
The SynonymQuery also sets the docFrequency to the maximum docFrequency of the terms in the document. So for example, if:
- "deprecation"df == 5
- "deprecation tax"df == 2,
- "deprecation taffy"df == 1,
it will use 5
as the docFrequency for scoring purposes.
The bigger issue is that Lucene doesn't have a way to differentiate shingles from synonyms... they both use tokens that overlap the position of other tokens in the token stream. So if unigrams are mixed with bi-(or larger)-grams, Lucene is tricked into thinking it's actually a synonym situation.
The fix is to keep your unigrams and bi-plus-grams in different fields. That way Lucene won't attempt to use SynonymQueries in these situations, because the positions won't be overlapping anymore.