Trying to understand the behaviour of query string with double quotes

I couldn't find a document that explains how the double quotes (exact string matches) in query strings work, so I did a bit of an experiment.

Both the analyzer and search analyzer of my document field was created with a chain of:

standard tokenzier -> lowercase -> stopwords -> shingles (min:2, max:3)

And index a document with:

" Typhoon Lekima, known in the Philippines as the Typhoon Hanna, was the second-costliest typhoon in Chinese history, only behind Fitow in 2013.[1] The ninth named storm of the 2019 Pacific typhoon season, Lekima originated from a tropical depression that formed east of the Philippines on July 30. It gradually organized, became a tropical storm and was named on August 4. Lekima intensified under favourable environmental conditions and peaked as a Category 4–equivalent super typhoon. However, an eyewall replacement cycle caused the typhoon to weaken before it made landfall in Zhejiang late on August 9, as a Category 2–equivalent typhoon. Lekima weakened subsequently while moving across the East China, and made its second landfall in Shandong on August 11."

Below are the query string and results:

  1. " Lekima intensified under favourable environmental conditions and peaked"
    Hit.

  2. "Lekima intensified on favourable environmental conditions or peaked"
    Hit. (Change stopwords)

  3. " Lekima intensified under under favourable environmental conditions and peaked"
    No hit. (Add stopwords)

  4. "intensified Lekima under favourable environmental conditions and peaked"
    No hit. (Change sequence of token)

So it feels to me that the double quotes are simply skipping the tokenzier? It also does a simple text match and not relies on the tokens and frequencies - since both the document and query are fed through the rest of the filters, changing the stopwords returns the same result. The stopwords filter replaces them with underscore so adding more stopwords gives no hit. The shingles filter probably will be skipped too in this case, because there's no tokens fed into the chain?

But I wonder how was this indexed - or not indexed at all? When I tried it with a larger database (c.a. 30000 documents), it feels a bit too fast for the system to first process everything and search through with the query.

Thanks!

The double quotes is a phrase query, meaning the words have to appear in sequence (positioned one after the other). The "slop" factor allows for some sloppiness in the required word positions.

Stopwords leave "holes" in the positions (check out the analyze api to see the effects of stopwords and positions.).
So with your under example we don't care what stopword was used at that position (in/on/under) but we do care that there was a single word used at that position.
With your under under example you're effectively expecting two holes to be in the sequence but in the original doc there is only one.
So when it comes to spacing, stopwords act a little like the blank tile in Scrabble if you're familiar with that.

So with a phrase query, does elasticsearch just go through the entire database in real time, or there is some sort of indexing in place, aside from the term frequencies?

The index has an entry for each word.
Each word has a list of matching docs.
Each of these also has a list of that word's positions in that doc

It's designed to be fast :slight_smile:

Right, a list of word positions too... of course, that makes sense. Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.