I have an analyzer that analyzes to create redundant tokens. For instance:
Original string: "TokyoJapan Samurai"
After tokenizing-->
- "TokyoJapan"
- "Samurai"
After token filter--> - "tokyo", "japan", "tokyojapan"
- "samurai"
therefore, we get a total of 4 tokens ie ("tokyo", "japan", "tokyojapan", "samurai")
Now if i issue an ANDed match query using the same analyzer to match the string "OsakaJapan Samurai":
after token filter (just like above), the tokens generated should be-->
- "osaka", "japan", "osakajapan"
- "samurai"
However, this matches a field containing "TokyoJapan Samurai" as well. The reason is that even with the AND for match queries, it is internally looking for:
MATCH any of tokens generated in (1) ie ("osaka" OR "japan" OR "osakajapan")
AND any of tokens generated in (2) ie ("samurai")
ie ("osaka" OR "japan" OR "osakajapan") AND ("samurai")
I would have ideally liked it to be:
MATCH (("osaka" and "japan") OR "osakajapan") AND "samurai"
This way the following docs would have matched/notmatched -->
- "OsakaJapan Samurai" -matches obviously
- "osakAjapan Samurai" - matches only because osakajapan matches, and not because "osak" "Ajapan" exists
- "Osaka Samurai" - matches because 'osaka' and 'samurai' match.
- "TokyoJapan samurai" -- doesnt match because neither (tokyo and japan) nor (tokyojapan) match.
Any idea how I can achieve this behavior?
I believe I'd have to use a different search analyzer from the indexing analyzer, but I may be wrong.