Analyzer issues using query_string in ES 1.5.2?


(James) #1

I'm a noob using ES 1.5.2 I want to ngram analyze a field on index, but do no analysis on search. Why? I want a user to be able to search for "group" and match the field "aged grouper" (no wildcards required - but still supported). However, if the user enters "aged grouper" I only want to match documents where my search field contains (at least) that entire phase.

I created an ngram analyzer that I map to the field for index, and a "dummy analyzer" (to keep the whole phrase together) that I map to the field for search. I can test both analyzers using the analyze api, and see that they are getting tokenized correctly.

Everything seems correct. However, when I do my query_string search, the search text still gets tokenized into words. So, searching for "group" DOES find "aged grouper" but searching for "the group" finds all documents that have EITHER "the" OR "group" in them. I want the whole phrase to be used in the search.

I'm confused that when I use the analyze api and the validate api I seem to get two different answers (I think):

If I use the analyze api: _analyze/analyzer=dummy_analyzer&text=Hello there
..
<token>Hello there</token> <== looks correct
..

However, If I use the validate api:
_validate/query?pretty=true&explain=true&analyzer=dummy_analyzer

{ "query" : {
     "query_string" : {
         "query" : "Hello there",
         "default_field" : "tfield",
         "analyzer" : "dummy_analyzer"
      }
   }
}

results in:
<explanation>props.tfield:Hello props.tfield:there</explanation> <== looks INCORRECT (breaking phrase apart)

My config is below. My questions:

  • Can someone explain the differences between the api results?
  • Why isn't the search using the dummy_analyzer (would you expect this approach to work)?
  • Is there a better way to have a field not analyzed on search only rather than using my kludged dummy_analyzer)

Thanks very much for any insight! -J

"analysis":{
  "analyzer":{
      "ngram_analyzer":{
          "type":"custom",
          "tokenizer":"ngram_tokenizer"
      },
      "dummy_analyzer":{
        "type":"pattern",
        "pattern":"00xyzzy00"  <-- a dummy string trying to never separate words
      }
   },
   "tokenizer":{
       "ngram_tokenizer": {
           "type":"nGram",
           "min_gram":"4",
           "max_gram":"500"
        }
    }
}

"mapping":{
 ....
   "tfield":{
       "index_analyzer":"ngram_analyzer",
       "search_analyzer":"dummy:analyzer",
       "type": "string",
       "index","analyzed"
    }
....

Is there a NOOP analyzer?
(Luca Cavanna) #2

I think these differences may just have to do with using the query_string. May I ask if you tried the match query instead? Or are there features that you need out of the query_string query?


(James) #3

The syntax of the query_string is ideal for my users. I suppose I could give up on providing wild card. It was suggested that I abandon query_string and use match (working on that now)

My goal:

  • query "fort" should match "unfortunately" (as if "*fort*" was entered)
  • query "is unfortunate" should only match fields with (at least) that whole phrase
  • query "my fort??e" should match "my fortune" (in the best of all possible worlds)
  • queries should allow for simple AND, OR, NOT, and () grouping logic
  • there can be no fuzziness (only allow exact phase matches with constraints above)

I get the leading/trailing wildcard simulation by indexing with an ngram analyzer (I can test with the analyze api and verify it is working correctly).

Does this seem feasible with a match (minus the mid-term wildcard support)?


(system) #4