Word ngrams with match phrase

I've indexed my documents with a common_grams token filter. It's working well. If I search for kiss the frog it will find documents, when searching with match phrase query, that contains text like you must kiss the frog he said. So good so far.

However, if I search for kiss the frog she (note the she instead of he) it won't find anything with a match phrase query. Makes sense because that string of text doesn't appear in any document (she instead of he). I can switch to a regular match query and it finds documents like the frog was green or kiss the girl said the witch. In my context, this is often not desirable because I get like 12,000 matches or some other ridiculous high amount of matches. Basically, documents that contain frog and kiss the and the frog etc. Not very useful. (Also, this was a very simple and short example.)

What I'd really like to do is "re-attempt the query". A human might notice and say "Try searching for just kiss the frog and skip the last word.". I've tried writing some Python code with some basic natural language processing that breaks up the whole search query term into smaller parts, and re-attempt each one, one at a time. This functionality kicks in only when 0 documents are found with a match phrase query. But this is kinda lousy as there are a crazy amount of word-ngrams combinations in a long search term.

For example, from the query term:

cause im not who you think i am you think i wont hurt no body you think i can be your medicine

I get:

['cause im not who you',
 'cause im not who',
 'cause im not',
 'cause im not who you think',
 'cause im not who you',
 'cause im not who',
 'think i wont hurt',
 'wont hurt no body',
 'you think i wont',
 'i wont hurt no',
 'hurt no body you',
 'no body you think',
 'body you think i']

That's 13 additional queries! And sometimes, I have to go further and look for even shorter word-ngrams and there's even more of those.
Also, sometimes I go even further and do boolean "must" queries on combinations like:

[['cause im not', 'no body you think'],
['cause im not', 'you think i wont'],
...

I would love some guidance and tips on how to improve on this?

My goal is to have what Google has, where they, in a split second, realize that the whole search term doesn't match anything but the search results are still good and they have to say "Missing: /she/".
30%20AM

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.