The relation of the amount of indexed documents, max_expansions and prefix_length in text_phrase_prefix

Hi all,

I have an issue with ES 0.19.3 regarding to text_phrase_prefix query. When
number of documents indexed in ES is small the following query
works perfectly (type "New Y", not "New York" or "New Yo")

curl -X DELETE http://es1:9200/cities
curl -X POST "http://es1:9200/cities/city" -d '{ "city" : "New York" }'
curl -X POST "http://es1:9200/cities/city" -d '{ "city" : "North New York"
}'
curl -X POST "http://es1:9200/cities/city" -d '{ "city" : "East New York" }'

curl -XGET http://es1:9200/cities/city/_search?pretty=true -d'
{
"fields":[
"city"
],
"query":{
"text_phrase_prefix": {
"city" : {
"query": "New Y",
"max_expansions": 2,
"prefix_length": 2
}
}
},
"from":0,
"size":20
}'

returns all cities or areas has "New York" in their names.

{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.38356602,
"hits" : [ {
"_index" : "cities",
"_type" : "city",
"_id" : "4ObkgggqS7uou1XLdwOkfA",
"_score" : 0.38356602,
"fields" : {
"city" : "New York"
}
}, {
"_index" : "cities",
"_type" : "city",
"_id" : "CZutMgvwSfa8O79Vajkshg",
"_score" : 0.30685282,
"fields" : {
"city" : "North New York"
}
}, {
"_index" : "cities",
"_type" : "city",
"_id" : "ZGA3gno9QnOIBg2MxxsPbg",
"_score" : 0.30685282,
"fields" : {
"city" : "East New York"
}
} ]
}
}

However when the number of indexed document grows up (more than 30 000
cities or towns or areas in US), the above query does not work any more.
I need to increase max_expansions into a number that is greater than 17 (18
and greater to be specific) to make it work again. Any number that
is smaller than 17 does not work. If I don't increase max_expansions, I
need to use keywords like: "New Yo" or "New York"

curl -XGET http://184.72.29.x:9200/cities/city/_search?pretty=true -d'
{
"fields":[
"area_label"
],
"query":{
"text_phrase_prefix": {
"area_label" : {
"query": "New Y",
"max_expansions": 18,
"prefix_length": 2
}
}
},
"from":0,
"size":20
}
'

returns

{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 111.86473,
"hits" : [ {
"_index" : "cities",
"_type" : "city",
"_id" : "195232",
"_score" : 111.86473,
"fields" : {
"area_label" : "New York"
}
}, {
"_index" : "cities",
"_type" : "city",
"_id" : "46727",
"_score" : 89.49178,
"fields" : {
"area_label" : "North New York"
}
}, {
"_index" : "cities",
"_type" : "city",
"_id" : "46772",
"_score" : 89.49178,
"fields" : {
"area_label" : "East New York"
}
} ]
}
}

prefix_length does not play any role in this case. I increase the value of
prefix_length to 20, the result is still the same.

I don't understand why the number of 18 is magic in this case. I guess that
there is a relationship between max_expansions and the number of
indexed document. So when the amount of indexed documents increases, I need
to increase max_expansions too or the above query does not work again.

Am I missing something?

Regards,

Dinh

--

Hi Dinh

However when the number of indexed document grows up (more than 30 000
cities or towns or areas in US), the above query does not work any
more.
I need to increase max_expansions into a number that is greater than
17 (18 and greater to be specific) to make it work again. Any number
that
is smaller than 17 does not work. If I don't increase max_expansions,
I need to use keywords like: "New Yo" or "New York"

prefix_length does not play any role in this case. I increase the
value of prefix_length to 20, the result is still the same.

prefix_length is for fuzzy queries, not phrase_prefix

I don't understand why the number of 18 is magic in this case. I guess
that there is a relationship between max_expansions and the number of
indexed document. So when the amount of indexed documents increases, I
need to increase max_expansions too or the above query does not work
again.

To build a prefix query, it looks for all terms starting with your
prefix 'y' and adds each term to your query, up to max_expansions. The
terms are sorted alphabetically.

If you've indexed lots more data, than presumably you have a lot more
terms between "ya" and "yo" than you had before.

if you want to do partial matching of words, then a better idea is to
use ngrams or edge-ngrams to index your data up front. it is more
efficient at search time than using a prefix query.

clint

--