Fuzzy matching against multiple fields


(pcdinh) #1

Hi all,

I want to make a fuzzy query that is expected to work on multiple
fields. What is the correct way to do it?

I made the following queries https://gist.github.com/923848 but it
does not work

Thanks

pcdinh


(Clinton Gormley) #2

Hi pcdinh

I want to make a fuzzy query that is expected to work on multiple
fields. What is the correct way to do it?

I made the following queries https://gist.github.com/923848 but it
does not work

The easiest way to do it is to use the ~ operator in the query string,
for instance:

curl -XGET 'http://127.0.0.1:9200/accdev/exp/_search?pretty=1' -d '
{
"query" : {
"query_string" : {
"query" : "username:pcdinh~0.5 fullname_idx:pcdinh~0.1^3",
"fuzzy_prefix_length" : 1
}
}
}
'

For more on the Lucene query string syntax, see:
http://lucene.apache.org/java/3_0_0/queryparsersyntax.html

The query string is a bit less configurable than a fuzzy query (for
instance you can't specify a per-field fuzzy_prefix_length.

In order to use the fuzzy query against two different fields, you need
to use two fuzzy queries:

        {
           "fuzzy" : {
              "username" : {
                 "min_similarity" : 0.5,
                 "boost" : 1,
                 "value" : "pcdin",
                 "prefix_length" : 0
              }
           }
        },
        {
           "fuzzy" : {
              "fullname_idx" : {
                 "min_similarity" : 0.1,
                 "boost" : 3,
                 "value" : "pcdinh",
                 "prefix_length" : 1
              }
           }
        }

But you need to combine these two queries somehow. Your options are to
wrap them in either a 'bool' query, or a 'dis_max' query:

bool:http://www.elasticsearch.org/guide/reference/query-dsl/bool-query.html

curl -XGET 'http://127.0.0.1:9200/accdev/exp/_search?pretty=1' -d '
{
"query" : {
"bool" : {
"should" : [
FUZZY QUERIES HERE
]
}
}
}
'

dis_max:http://www.elasticsearch.org/guide/reference/query-dsl/dis-max-query.html

curl -XGET 'http://127.0.0.1:9200/accdev/exp/_search?pretty=1' -d '
{
"query" : {
"dis_max" : {
"queries" : [
FUZZY QUERIES HERE
],
"tie_breaker" : 0.7
}
}
}
'

The difference between bool and dis_max is how it combines the score if
both queries match. The bool query would add both their scores
together. The dis_max would return the better score.

In your case, where you're searching for a matching user, it would
probably make sense to use the bool query, so the full query would look
like this:

curl -XGET 'http://127.0.0.1:9200/accdev/exp/_search?pretty=1' -d '
{
"query" : {
"bool" : {
"should" : [
{
"fuzzy" : {
"username" : {
"min_similarity" : 0.5,
"boost" : 1,
"value" : "pcdin",
"prefix_length" : 0
}
}
},
{
"fuzzy" : {
"fullname_idx" : {
"min_similarity" : 0.1,
"boost" : 3,
"value" : "pcdinh",
"prefix_length" : 1
}
}
}
]
}
}
}
'

Alternatively, unless you have disabled it, the values for both username
and fullname_idx are also indexed in the _all field. So, (sacrificing
the per-field customizations) you could just do:

curl -XGET 'http://127.0.0.1:9200/accdev/exp/_search?pretty=1' -d '
{
"query" : {
"fuzzy" : {
"_all" : {
"min_similarity" : 0.1,
"value" : "pcidnh"
}
}
}
}
'

Or even:

curl -XGET 'http://127.0.0.1:9200/accdev/exp/_search?pretty=1' -d '
{
"query" : {
"field" : {
"_all" : "pcidnh~0.1"
}
}
}
'

clint


(pcdinh) #3

Hi Clinton,

Thank a lot for your excellent answer. I will use Lucene syntax to
reduce request bandwidth to ES server.

Also, it is great to see your answer part of ES documentation

Regards,

pcdinh

On 17 Tháng Tư, 16:31, Clinton Gormley clin...@iannounce.co.uk
wrote:

Hi pcdinh

I want to make a fuzzy query that is expected to work on multiple
fields. What is the correct way to do it?

I made the following querieshttps://gist.github.com/923848but it
does not work

The easiest way to do it is to use the ~ operator in the query string,
for instance:

curl -XGET 'http://127.0.0.1:9200/accdev/exp/_search?pretty=1' -d '
{
"query" : {
"query_string" : {
"query" : "username:pcdinh~0.5 fullname_idx:pcdinh~0.1^3",
"fuzzy_prefix_length" : 1
}
}}

'

For more on the Lucene query string syntax, see:http://lucene.apache.org/java/3_0_0/queryparsersyntax.html

The query string is a bit less configurable than a fuzzy query (for
instance you can't specify a per-field fuzzy_prefix_length.

In order to use the fuzzy query against two different fields, you need
to use two fuzzy queries:

        {
           "fuzzy" : {
              "username" : {
                 "min_similarity" : 0.5,
                 "boost" : 1,
                 "value" : "pcdin",
                 "prefix_length" : 0
              }
           }
        },
        {
           "fuzzy" : {
              "fullname_idx" : {
                 "min_similarity" : 0.1,
                 "boost" : 3,
                 "value" : "pcdinh",
                 "prefix_length" : 1
              }
           }
        }

But you need to combine these two queries somehow. Your options are to
wrap them in either a 'bool' query, or a 'dis_max' query:

bool:http://www.elasticsearch.org/guide/reference/query-dsl/bool-query.html

curl -XGET 'http://127.0.0.1:9200/accdev/exp/_search?pretty=1' -d '
{
"query" : {
"bool" : {
"should" : [
FUZZY QUERIES HERE
]
}
}}

'

dis_max:http://www.elasticsearch.org/guide/reference/query-dsl/dis-max-query....

curl -XGET 'http://127.0.0.1:9200/accdev/exp/_search?pretty=1' -d '
{
"query" : {
"dis_max" : {
"queries" : [
FUZZY QUERIES HERE
],
"tie_breaker" : 0.7
}
}}

'

The difference between bool and dis_max is how it combines the score if
both queries match. The bool query would add both their scores
together. The dis_max would return the better score.

In your case, where you're searching for a matching user, it would
probably make sense to use the bool query, so the full query would look
like this:

curl -XGET 'http://127.0.0.1:9200/accdev/exp/_search?pretty=1' -d '
{
"query" : {
"bool" : {
"should" : [
{
"fuzzy" : {
"username" : {
"min_similarity" : 0.5,
"boost" : 1,
"value" : "pcdin",
"prefix_length" : 0
}
}
},
{
"fuzzy" : {
"fullname_idx" : {
"min_similarity" : 0.1,
"boost" : 3,
"value" : "pcdinh",
"prefix_length" : 1
}
}
}
]
}
}}

'

Alternatively, unless you have disabled it, the values for both username
and fullname_idx are also indexed in the _all field. So, (sacrificing
the per-field customizations) you could just do:

curl -XGET 'http://127.0.0.1:9200/accdev/exp/_search?pretty=1' -d '
{
"query" : {
"fuzzy" : {
"_all" : {
"min_similarity" : 0.1,
"value" : "pcidnh"
}
}
}}

'

Or even:

curl -XGET 'http://127.0.0.1:9200/accdev/exp/_search?pretty=1' -d '
{
"query" : {
"field" : {
"_all" : "pcidnh~0.1"
}
}}

'

clint


(system) #4