Disable stop words and stemming for phrases searches when using query_string


(Imran Azad) #1

How can I disable stop words and stemming for phrase searches when using the query_string? For example I want the user to search for "the mouth" and I want documents to be returned with the exact phrase.


(Doug Turnbull) #2

Do you stem/strip stopwords at index time? If so, you can't reenable them at query time. They're simply gone until you restructure your index with stopwords/stemming disabled. Otherwise, you can specify a different query analyzer that takes out stopwords.

Do you want to prioritize exact matches over stemmed/stopworded matches? There's a couple ways you could do that.

  1. Create an extra field that has exact text
  2. Run two query_string queries, each wrapped in a SHOULD clause in an outer boolean query
  3. Exact matches will match both SHOULD clauses and ranked higher
  4. Inexact/stemmed matches will match just one

You might also be interested in combo analyzers which insert both stemmed/non stemmed tokens into a single field.


(Imran Azad) #3

Hi Doug,

Thanks for the response. We stem at index time, however I also have a separate field which indexes content without stemming.

The key issue that I forgot to mention was that we require the ability to have stemming enabled for non-phrases searches but disabled for phrase searches, to complicate matters further we would also need it to cater for "mixed" queries where you have something like "the mouth" OR the cancerous cells AND drugs

So we need the Boolean ability with the stemming enabled for key word searches but no stemming for phrases, hence why we are using the query_string as much fun as it is we didn't want to write our own parser!

The key issue I have discovered is with the highlighting. So for example take the following query which queries two fields, one with stems and one without:

"the mouth"

it returns a document with two highlighted fields:

"the mouth"
"mouth"

What I really need is the ability to merge the two so that there is only one highlight.


(system) #4