Match query unexpected behavior with token filter


#1

I have an analyzer that analyzes to create redundant tokens. For instance:

Original string: "TokyoJapan Samurai"
After tokenizing-->

  1. "TokyoJapan"
  2. "Samurai"
    After token filter-->
  3. "tokyo", "japan", "tokyojapan"
  4. "samurai"
    therefore, we get a total of 4 tokens ie ("tokyo", "japan", "tokyojapan", "samurai")

Now if i issue an ANDed match query using the same analyzer to match the string "OsakaJapan Samurai":
after token filter (just like above), the tokens generated should be-->

  1. "osaka", "japan", "osakajapan"
  2. "samurai"

However, this matches a field containing "TokyoJapan Samurai" as well. The reason is that even with the AND for match queries, it is internally looking for:
MATCH any of tokens generated in (1) ie ("osaka" OR "japan" OR "osakajapan")
AND any of tokens generated in (2) ie ("samurai")

ie ("osaka" OR "japan" OR "osakajapan") AND ("samurai")

I would have ideally liked it to be:
MATCH (("osaka" and "japan") OR "osakajapan") AND "samurai"

This way the following docs would have matched/notmatched -->

  1. "OsakaJapan Samurai" -matches obviously
  2. "osakAjapan Samurai" - matches only because osakajapan matches, and not because "osak" "Ajapan" exists
  3. "Osaka Samurai" - matches because 'osaka' and 'samurai' match.
  4. "TokyoJapan samurai" -- doesnt match because neither (tokyo and japan) nor (tokyojapan) match.

Any idea how I can achieve this behavior?

I believe I'd have to use a different search analyzer from the indexing analyzer, but I may be wrong.


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.