Shingles vs phrases for index size

I am looking at using either shingles or phrases to allow phrase queries, however it seems to me that shingles must have a massive impact on index size. This is acknowledged in the documentation ( "The downside is that you have larger indices...". However it seems that the indexes won't be just large, but massively so.

Given that phrase queries are reasonably efficient, especially for rare word combinations that will often narrow down to a single document before even looking at a word-ordered phrase, is shingles really with the massively bloated index that comes with them?

Hi Daniel,

Unfortunately, no one can really answer that for you. You'll need to test with your data so that you can determine whether the search performance of phrase queries (which, as you said, is reasonably efficient) is acceptable. If it is not, then shingles may be a tradeoff you want to check into.

One advantage of shingles over single terms is that Lucene will store the frequency of the word pair rather than just the frequencies of the individual terms.
Search ranking and discovery of phrases (using the significant_text aggregation) is improved when phrases are given their true worth through shingles. A phrase like “global warming” will be scored and presented (in significant text discovery) as a whole concept rather than, individual terms like “global” which on their own are boring. The fact that significant_text discovers and presents word-pairs is especially useful for finding names of people strongly connected with a topic.
I tend to stick with max shingle size of 2 because it’s a sweet spot for the costs/benefits

@Mark_Harwood thanks that makes sense about frequencies. To clarify the connection between phrases and shingles, if I use match_phrase will it automatically try and match against shingles before trying positional phrase matching? What about plain "match" (default bool OR) - does it care about shingles at all?

It depends on the output of your analysis chain. If you've configured things to produce shingles, then a query like "global warming" will get analyzed to a single term and so will produce a term query. For a shingle setting of 3, "catastrophic global warming" would also get analyzed to a single term, but with a setting of 2, it will be analyzed to two terms "catastrophic global" and "global warming", and produce a positional query. Neither match_phrase or match care about shingles as such, they just operate on the outputs of the query analyzer.

Note that there's been an 'index_phrases' option on text fields since 6.3, which handles shingling behind the scenes so that you don't need to worry about analysis configuration - see under

1 Like


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.