Query on multiple fields


(hemant pahilwani) #1

I am new to elastic search and am currently migrating the lucene queries
that i have. Following is the setup for index properties :

index :
analysis :
analyzer :
myStandard :
tokenizer : nGram
filter : [standard, lowercase, stop]

"user" : {
"_source" : { "enabled" : true },
"properties" : {
"userId" : { "type" : "string", "index" : "analyzed",
"analyzer" : "myStandard" },
"firstName" : { "type" : "string", "index" : "analyzed",
"analyzer" : "myStandard" },
"lastName" : { "type" : "string", "index" : "analyzed",
"analyzer" : "myStandard" },
"email" : { "type" : "string", "index" : "not_analyzed"}
}
}

Situation is that user can enter any text in search box to search for user
and search needs to be performed against firstName and lastName. It would
be a OR query with both prefix and suffix wild card. Like firstName OR
lastName
exact matches of firstName and lastName have to be displayed top. What
would be best way to acheive this? bool query?? also is nGram the correct
tokenizer in these situations?

--


(hemant pahilwani) #2

Okay i figured out query for my situation but the result coming back has
few records that i wasnt expecting

data in index:

"1", "John", "Doe", "johndoe@somecompany.com"

"2", "Johny", "Doey", "johndoe34@somecompany.com"

"3", "Johnathan", "Chang", "johndoe56@ somecompany.com"

"4", "longjohn", "silver", "johndoe76@somecompany com"

"5", "dijohn", "Doera", "johndoe12@ somecompany.com"

"6", "papajohn", "Do", "johndoe09@somecompany.com"

"7", "ojohnyyy", "Dey", "johndoe87@somecompany.com"

"8", "jerry", "seinfield", "someemails@somecompany.com"

"9", "kramer", "dicosta", "someemailagain@somecompany.com"

*Following query is returning all the records: *

curl -XGET 'http://localhost:9200/index/_search?&pretty=true' --data-binary
'{"query":{"query_string":{"fields":["firstName","lastName"],"query":"john
doe
","use_dis_max":true}}}'

Records with id's from 1-7 are correct and are expected but 8 & 9 are also
being returned which dont really match any of the values passed in query.

Any ideas on how to ignore/remove the last 2 records from search result?

On Thursday, October 11, 2012 12:39:18 AM UTC-7, hemantp wrote:

I am new to elastic search and am currently migrating the lucene queries
that i have. Following is the setup for index properties :

index :
analysis :
analyzer :
myStandard :
tokenizer : nGram
filter : [standard, lowercase, stop]

"user" : {
"_source" : { "enabled" : true },
"properties" : {
"userId" : { "type" : "string", "index" : "analyzed",
"analyzer" : "myStandard" },
"firstName" : { "type" : "string", "index" : "analyzed",
"analyzer" : "myStandard" },
"lastName" : { "type" : "string", "index" : "analyzed",
"analyzer" : "myStandard" },
"email" : { "type" : "string", "index" : "not_analyzed"}
}
}

Situation is that user can enter any text in search box to search for user
and search needs to be performed against firstName and lastName. It would
be a OR query with both prefix and suffix wild card. Like firstName OR
lastName
exact matches of firstName and lastName have to be displayed top. What
would be best way to acheive this? bool query?? also is nGram the correct
tokenizer in these situations?

--


(David Pilato) #3

Looks like you are using a default ngram filter.
By default, ngrams have min size 1 and max size 2.

So, I suppose that seinfield is tokenized as "s, se, e, ei, ... d".
When you search on john doe, it's also tokenized with the same analyzer. And you
have a "d" in "doe". So d match d and you get your document back.

Correct me if I'm wrong.

David.

Le 11 octobre 2012 à 10:39, hemantp hemant.pahilwani@gmail.com a écrit :

Okay i figured out query for my situation but the result coming back has few
records that i wasnt expecting

data in index:

"1", "John", "Doe", "johndoe@somecompany.com"

"2", "Johny", "Doey", "johndoe34@somecompany.com"

"3", "Johnathan", "Chang", "johndoe56@ somecompany.com"

"4", "longjohn", "silver", "johndoe76@somecompany com"

"5", "dijohn", "Doera", "johndoe12@ somecompany.com"

"6", "papajohn", "Do", "johndoe09@somecompany.com"

"7", "ojohnyyy", "Dey", "johndoe87@somecompany.com"

"8", "jerry", "seinfield", "someemails@somecompany.com"

"9", "kramer", "dicosta", "someemailagain@somecompany.com"

Following query is returning all the records:

curl -XGET 'http://localhost:9200/index/_search?&pretty=true' --data-binary
'{"query":{"query_string":{"fields":["firstName","lastName"],"query":"john
doe","use_dis_max":true}}}'

Records with id's from 1-7 are correct and are expected but 8 & 9 are also
being returned which dont really match any of the values passed in query.

Any ideas on how to ignore/remove the last 2 records from search result?

On Thursday, October 11, 2012 12:39:18 AM UTC-7, hemantp wrote:

I am new to elastic search and am currently migrating the lucene
queries that i have. Following is the setup for index properties :

index :
analysis :
analyzer :
myStandard :
tokenizer : nGram
filter : [standard, lowercase, stop]

"user" : {
"_source" : { "enabled" : true },
"properties" : {
"userId" : { "type" : "string", "index" : "analyzed",
"analyzer" : "myStandard" },
"firstName" : { "type" : "string", "index" : "analyzed",
"analyzer" : "myStandard" },
"lastName" : { "type" : "string", "index" : "analyzed",
"analyzer" : "myStandard" },
"email" : { "type" : "string", "index" : "not_analyzed"}
}
}

Situation is that user can enter any text in search box to search for
user and search needs to be performed against firstName and lastName. It
would be a OR query with both prefix and suffix wild card. Like firstName
OR lastName
exact matches of firstName and lastName have to be displayed top. What
would be best way to acheive this? bool query?? also is nGram the correct
tokenizer in these situations?

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--


(Clinton Gormley) #4

Hiya

curl -XGET 'http://localhost:9200/index/_search?&pretty=true'
--data-binary
'{"query":{"query_string":{"fields":["firstName","lastName"],"query":"john doe","use_dis_max":true}}}'

Records with id's from 1-7 are correct and are expected but 8 & 9 are
also being returned which dont really match any of the values passed
in query.

You're using ngrams on all of your fields, so "john" is becoming a
search for j,o,h,n,jo,oh,hn,joh,ohn,john

And the default operator for query string is OR, which means that any
document with any one of those letters is being returned.

clint

Any ideas on how to ignore/remove the last 2 records from search
result?

On Thursday, October 11, 2012 12:39:18 AM UTC-7, hemantp wrote:
I am new to elastic search and am currently migrating the
lucene queries that i have. Following is the setup for index
properties :

    index :
        analysis :
            analyzer : 
                myStandard :
                    tokenizer : nGram
                    filter : [standard, lowercase, stop]
    
    
    "user" : {
                  "_source" : { "enabled" : true },
                "properties" : {
                 "userId" : { "type" : "string", "index" :
    "analyzed", "analyzer" : "myStandard" },
                 "firstName" : { "type" : "string", "index" :
    "analyzed", "analyzer" : "myStandard" },
                     "lastName" : { "type" : "string", "index" :
    "analyzed", "analyzer" : "myStandard" },
                     "email" : { "type" : "string", "index" :
    "not_analyzed"}
                 }
    }
    
    
    Situation is that user can enter any text in search box to
    search for user and search needs to be performed against
    firstName and lastName. It would be a OR query with both
    prefix and suffix wild card. Like *firstName* OR *lastName*   
    exact matches of firstName and lastName have to be displayed
    top.  What would be best way to acheive this? bool query??
    also is nGram the correct tokenizer in these situations?  

--

--


(hemant pahilwani) #5

Thanks Clint and David. I tried changing min size and max size but the
results are not what i expected. May be i need to change the analyzer or
way i am querying. What would be the best way to query? is query_string
query appropriate for my situation??

On Thursday, October 11, 2012 2:51:33 AM UTC-7, Clinton Gormley wrote:

Hiya

curl -XGET 'http://localhost:9200/index/_search?&pretty=true'
--data-binary

'{"query":{"query_string":{"fields":["firstName","lastName"],"query":"john
doe","use_dis_max":true}}}'

Records with id's from 1-7 are correct and are expected but 8 & 9 are
also being returned which dont really match any of the values passed
in query.

You're using ngrams on all of your fields, so "john" is becoming a
search for j,o,h,n,jo,oh,hn,joh,ohn,john

And the default operator for query string is OR, which means that any
document with any one of those letters is being returned.

clint

Any ideas on how to ignore/remove the last 2 records from search
result?

On Thursday, October 11, 2012 12:39:18 AM UTC-7, hemantp wrote:
I am new to elastic search and am currently migrating the
lucene queries that i have. Following is the setup for index
properties :

    index : 
        analysis : 
            analyzer : 
                myStandard : 
                    tokenizer : nGram 
                    filter : [standard, lowercase, stop] 
    
    
    "user" : { 
                  "_source" : { "enabled" : true }, 
                "properties" : { 
                 "userId" : { "type" : "string", "index" : 
    "analyzed", "analyzer" : "myStandard" }, 
                 "firstName" : { "type" : "string", "index" : 
    "analyzed", "analyzer" : "myStandard" }, 
                     "lastName" : { "type" : "string", "index" : 
    "analyzed", "analyzer" : "myStandard" }, 
                     "email" : { "type" : "string", "index" : 
    "not_analyzed"} 
                 } 
    } 
    
    
    Situation is that user can enter any text in search box to 
    search for user and search needs to be performed against 
    firstName and lastName. It would be a OR query with both 
    prefix and suffix wild card. Like *firstName* OR *lastName*   
    exact matches of firstName and lastName have to be displayed 
    top.  What would be best way to acheive this? bool query?? 
    also is nGram the correct tokenizer in these situations?   

--

--


(hemant pahilwani) #6

Fuzzy search seems to be close to what is required but i am not able to
find a way to do fuzzy query on multiple fields. Is it possible?? I tried
doing through bool query but it didnt give me any results. Any ideas on
how to do fuzzy (not fuzzy like this) query on multiple fields?

On Thursday, October 11, 2012 11:22:21 AM UTC-7, hemantp wrote:

Thanks Clint and David. I tried changing min size and max size but the
results are not what i expected. May be i need to change the analyzer or
way i am querying. What would be the best way to query? is query_string
query appropriate for my situation??

On Thursday, October 11, 2012 2:51:33 AM UTC-7, Clinton Gormley wrote:

Hiya

curl -XGET 'http://localhost:9200/index/_search?&pretty=true'
--data-binary

'{"query":{"query_string":{"fields":["firstName","lastName"],"query":"john
doe","use_dis_max":true}}}'

Records with id's from 1-7 are correct and are expected but 8 & 9 are
also being returned which dont really match any of the values passed
in query.

You're using ngrams on all of your fields, so "john" is becoming a
search for j,o,h,n,jo,oh,hn,joh,ohn,john

And the default operator for query string is OR, which means that any
document with any one of those letters is being returned.

clint

Any ideas on how to ignore/remove the last 2 records from search
result?

On Thursday, October 11, 2012 12:39:18 AM UTC-7, hemantp wrote:
I am new to elastic search and am currently migrating the
lucene queries that i have. Following is the setup for index
properties :

    index : 
        analysis : 
            analyzer : 
                myStandard : 
                    tokenizer : nGram 
                    filter : [standard, lowercase, stop] 
    
    
    "user" : { 
                  "_source" : { "enabled" : true }, 
                "properties" : { 
                 "userId" : { "type" : "string", "index" : 
    "analyzed", "analyzer" : "myStandard" }, 
                 "firstName" : { "type" : "string", "index" : 
    "analyzed", "analyzer" : "myStandard" }, 
                     "lastName" : { "type" : "string", "index" : 
    "analyzed", "analyzer" : "myStandard" }, 
                     "email" : { "type" : "string", "index" : 
    "not_analyzed"} 
                 } 
    } 
    
    
    Situation is that user can enter any text in search box to 
    search for user and search needs to be performed against 
    firstName and lastName. It would be a OR query with both 
    prefix and suffix wild card. Like *firstName* OR *lastName*   
    exact matches of firstName and lastName have to be displayed 
    top.  What would be best way to acheive this? bool query?? 
    also is nGram the correct tokenizer in these situations?   

--

--


(Chris Male) #7

One thing to consider about Fuzzy search is that it is extremely expensive
performance wise. Having the leading wildcard is going to prevent any real
optimization as well.

I wonder what the impact of changing the default search operator to AND
would be. For 'john' that would mean the matched document would need to
have 'j', 'o', 'h', 'n' 'jo', 'oh', 'hn'. Going back to your list of
matched documents, this would still match the first 7 but wouldn't match 8
and 9, giving you the behaviour you want.

On Friday, October 12, 2012 9:12:59 AM UTC+13, hemantp wrote:

Fuzzy search seems to be close to what is required but i am not able to
find a way to do fuzzy query on multiple fields. Is it possible?? I tried
doing through bool query but it didnt give me any results. Any ideas on
how to do fuzzy (not fuzzy like this) query on multiple fields?

On Thursday, October 11, 2012 11:22:21 AM UTC-7, hemantp wrote:

Thanks Clint and David. I tried changing min size and max size but the
results are not what i expected. May be i need to change the analyzer or
way i am querying. What would be the best way to query? is query_string
query appropriate for my situation??

On Thursday, October 11, 2012 2:51:33 AM UTC-7, Clinton Gormley wrote:

Hiya

curl -XGET 'http://localhost:9200/index/_search?&pretty=true'
--data-binary

'{"query":{"query_string":{"fields":["firstName","lastName"],"query":"john
doe","use_dis_max":true}}}'

Records with id's from 1-7 are correct and are expected but 8 & 9 are
also being returned which dont really match any of the values passed
in query.

You're using ngrams on all of your fields, so "john" is becoming a
search for j,o,h,n,jo,oh,hn,joh,ohn,john

And the default operator for query string is OR, which means that any
document with any one of those letters is being returned.

clint

Any ideas on how to ignore/remove the last 2 records from search
result?

On Thursday, October 11, 2012 12:39:18 AM UTC-7, hemantp wrote:
I am new to elastic search and am currently migrating the
lucene queries that i have. Following is the setup for index
properties :

    index : 
        analysis : 
            analyzer : 
                myStandard : 
                    tokenizer : nGram 
                    filter : [standard, lowercase, stop] 
    
    
    "user" : { 
                  "_source" : { "enabled" : true }, 
                "properties" : { 
                 "userId" : { "type" : "string", "index" : 
    "analyzed", "analyzer" : "myStandard" }, 
                 "firstName" : { "type" : "string", "index" : 
    "analyzed", "analyzer" : "myStandard" }, 
                     "lastName" : { "type" : "string", "index" : 
    "analyzed", "analyzer" : "myStandard" }, 
                     "email" : { "type" : "string", "index" : 
    "not_analyzed"} 
                 } 
    } 
    
    
    Situation is that user can enter any text in search box to 
    search for user and search needs to be performed against 
    firstName and lastName. It would be a OR query with both 
    prefix and suffix wild card. Like *firstName* OR *lastName*   
    exact matches of firstName and lastName have to be displayed 
    top.  What would be best way to acheive this? bool query?? 
    also is nGram the correct tokenizer in these situations?   

--

--


(system) #8