Behaviour of match phrase query on english analyzed fields with stopwords


(Jaspreet Singh) #1

Hi,

I wanted to confirm a weird behaviour im observing when using match_phrase query on english analyzed fields with stopwords.

  1. Assume my search string is analytics and prediction.
  2. Assume again, when searching against an English analyzed field, the tokens generated are analyt and predict.
  3. Now, when doing a match_phrase search against that field, I would expect ONLY the following text phrases to match:
    • analytics and prediction
    • analytics prediction
    • analyze and prediction
    • analyze prediction
    • analysis predict
    • analysis and predict
      etc.
      ... since and being a stopword, instances where there is nothing between analytics and prediction, should also show up as a match in addition to where there is an and. But nothing else.
  4. However the behaviour im seeing is different (also backed by explain = true. Instead, the tokens match_phrase uses are analyse ? predict where ? is a wild card.
  5. So in essence it works like a match_phrase with a slop, matching ANY phrase that begins with words that stem to analyze and end with words that stem to predict.

I'm wondering why! It makes it almost impossible to get a strict phrase match whenever there is a stopword in query string.


(Jaspreet Singh) #2

Any thoughts anyone?


(Jaspreet Singh) #3

@dadoonet would appreciate any pointers


(Jaspreet Singh) #4

I figured it out. Really the gist is ...
For a document to be considered a match for any phrase say, “quick brown fox”, the following must be true:

  • quick , brown , and fox must all appear in the field.
  • The position of brown must be 1 greater than the position of quick .
  • The position of fox must be 2 greater than the position of quick .
    Then the specifics are down to the tokens that are generated.