Phrase matching using query_string on nGram analyzed data


(Mike) #1

I have my string field index_analyzed with nGrams, and I can't seem to get
phrase matching using " " in my search text to work. Other things like
fuzzy matching with ~, combining words with && and ||, boosting with ^ work
fine though. Am I doing something wrong, or does phrase matching not work
with ngrams?

My mapping:
"properties" : {

                "myquery"      : {                                     
                     
                    "type" : "multi_field",                             
                  
                    "fields" : {                                       
                   
                        "myquery"          : { "type" : "string", 

"index_analyzer" : "myAnalyzer", "search_analyzer" : "myAnalyzer2" },
"myqueryUntouched" : { "type" : "string",
"index" : "not_analyzed" }
}

                },
                ...                

My settings:
"analysis" : {

            "analyzer" : {                                             
                   
                "myAnalyzer" : {                                       
                   
                    "tokenizer" : "standard",                           
                  
                    "filter" : ["standard", "lowercase", "stop", 

"myNGram"]
},

                "myAnalyzer2" : {                                       
                  
                    "tokenizer" : "standard",                           
                  
                    "filter" : ["standard", "lowercase", "stop"]       
                   
                }                                                       
                  
            },                                                         
                   
            "filter" : {                                               
                   
                "myNGram" : {                                           
                  
                    "type" : "nGram",                                   
                  
                    "min_gram" : 1,                                     
                  
                    "max_gram" : 8                                     
                   
                }                                                      
                   
            }                                                           

My query:
"query":{
"query_string":{
"default_field":"myquery",
"default_operator":"AND",
"query":""ibm eps""
}
}

If I remove the escaped " ", I get many results as I expect, like:
ibm eps
ibm q2 eps
ibm 2001 eps

If someone adds " " though I want only the ibm eps results.

--


(Clinton Gormley) #2

Hi Mike

On Fri, 2012-09-14 at 15:11 -0700, Mike wrote:

I have my string field index_analyzed with nGrams, and I can't seem to
get phrase matching using " " in my search text to work. Other things
like fuzzy matching with ~, combining words with && and ||, boosting
with ^ work fine though. Am I doing something wrong, or does phrase
matching not work with ngrams?

Phrase matching does work with ngrams, but: there is a long-standing bug
in the edge-ngram analyzer in lucene which outputs different token
positions to the standard tokenizer.

So if you analyze the field with edge-ngrams and you do a phrase-search
on the field using the SAME analyzer, then it will work. But you are
using the standard tokenizer at search time, not the edge-ngram
tokenizer.

clint

My mapping:
"properties" : {

                "myquery"      : {
                           
                    "type" : "multi_field",
                        
                    "fields" : {
                         
                        "myquery"          : { "type" : "string",

"index_analyzer" : "myAnalyzer", "search_analyzer" : "myAnalyzer2" },

                        "myqueryUntouched" : { "type" : "string",

"index" : "not_analyzed" }
}

                },
                ...                

My settings:
"analysis" : {

            "analyzer" : {
                         
                "myAnalyzer" : {
                         
                    "tokenizer" : "standard",
                        
                    "filter" : ["standard", "lowercase", "stop",

"myNGram"]
},

                "myAnalyzer2" : {
                        
                    "tokenizer" : "standard",
                        
                    "filter" : ["standard", "lowercase", "stop"]
                         
                }
                        
            },
                         
            "filter" : {
                         
                "myNGram" : {
                        
                    "type" : "nGram",
                        
                    "min_gram" : 1,
                        
                    "max_gram" : 8
                         
                }
                         
            }

My query:
"query":{
"query_string":{
"default_field":"myquery",
"default_operator":"AND",
"query":""ibm eps""
}
}

If I remove the escaped " ", I get many results as I expect, like:
ibm eps
ibm q2 eps
ibm 2001 eps

If someone adds " " though I want only the ibm eps results.

--

--


(Mike) #3

Thanks for the response Clint! I assume what you said applies to both the
edge-nGram and regular nGram filters, since I am only using the regular
nGrams filter in my index analyzer.

You mentioned that I should use the ngram tokenizer not the standard
tokenizer, does this mean that I should not use the ngram filter? I was
hoping to get partial search matches, which is why I used the ngram filter
only during index time and not during query time as well (national should
find a match with international).

--


(Clinton Gormley) #4

On Mon, 2012-09-17 at 07:40 -0700, Mike wrote:

    Thanks for the response Clint!  I assume what you said applies
    to both the edge-nGram and regular nGram filters, since I am
    only using the regular nGrams filter in my index analyzer.  

Yes, it affects the ngrams as well:

https://issues.apache.org/jira/browse/LUCENE-1224

    You mentioned that I should use the ngram tokenizer not the
    standard tokenizer, does this mean that I should not use the
    ngram filter?  I was hoping to get partial search matches,
    which is why I used the ngram filter only during index time
    and not during query time as well (national should find a
    match with international).

No, you can use the ngram tokenizer or token filter. The important
thing is to use the same analyzer at index and search time. This is
almost a golden rule, unless you really understand what you're doing.

clint

--

--


(system) #5