The relation of the amount of indexed documents, max_expansions and prefix_length in text_phrase_prefix

Hi all,

I have an issue with ES 0.19.3 regarding to text_phrase_prefix query. When
number of documents indexed in ES is small the following query
works perfectly (type "New Y", not "New York" or "New Yo")

curl -X DELETE http://es1:9200/cities
curl -X POST "http://es1:9200/cities/city" -d '{ "city" : "New York" }'
curl -X POST "http://es1:9200/cities/city" -d '{ "city" : "North New York"
curl -X POST "http://es1:9200/cities/city" -d '{ "city" : "East New York" }'

curl -XGET http://es1:9200/cities/city/_search?pretty=true -d'
"text_phrase_prefix": {
"city" : {
"query": "New Y",
"max_expansions": 2,
"prefix_length": 2

returns all cities or areas has "New York" in their names.

"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
"hits" : {
"total" : 3,
"max_score" : 0.38356602,
"hits" : [ {
"_index" : "cities",
"_type" : "city",
"_id" : "4ObkgggqS7uou1XLdwOkfA",
"_score" : 0.38356602,
"fields" : {
"city" : "New York"
}, {
"_index" : "cities",
"_type" : "city",
"_id" : "CZutMgvwSfa8O79Vajkshg",
"_score" : 0.30685282,
"fields" : {
"city" : "North New York"
}, {
"_index" : "cities",
"_type" : "city",
"_id" : "ZGA3gno9QnOIBg2MxxsPbg",
"_score" : 0.30685282,
"fields" : {
"city" : "East New York"
} ]

However when the number of indexed document grows up (more than 30 000
cities or towns or areas in US), the above query does not work any more.
I need to increase max_expansions into a number that is greater than 17 (18
and greater to be specific) to make it work again. Any number that
is smaller than 17 does not work. If I don't increase max_expansions, I
need to use keywords like: "New Yo" or "New York"

curl -XGET http://184.72.29.x:9200/cities/city/_search?pretty=true -d'
"text_phrase_prefix": {
"area_label" : {
"query": "New Y",
"max_expansions": 18,
"prefix_length": 2


"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
"hits" : {
"total" : 3,
"max_score" : 111.86473,
"hits" : [ {
"_index" : "cities",
"_type" : "city",
"_id" : "195232",
"_score" : 111.86473,
"fields" : {
"area_label" : "New York"
}, {
"_index" : "cities",
"_type" : "city",
"_id" : "46727",
"_score" : 89.49178,
"fields" : {
"area_label" : "North New York"
}, {
"_index" : "cities",
"_type" : "city",
"_id" : "46772",
"_score" : 89.49178,
"fields" : {
"area_label" : "East New York"
} ]

prefix_length does not play any role in this case. I increase the value of
prefix_length to 20, the result is still the same.

I don't understand why the number of 18 is magic in this case. I guess that
there is a relationship between max_expansions and the number of
indexed document. So when the amount of indexed documents increases, I need
to increase max_expansions too or the above query does not work again.

Am I missing something?




Hi Dinh

However when the number of indexed document grows up (more than 30 000
cities or towns or areas in US), the above query does not work any
I need to increase max_expansions into a number that is greater than
17 (18 and greater to be specific) to make it work again. Any number
is smaller than 17 does not work. If I don't increase max_expansions,
I need to use keywords like: "New Yo" or "New York"

prefix_length does not play any role in this case. I increase the
value of prefix_length to 20, the result is still the same.

prefix_length is for fuzzy queries, not phrase_prefix

I don't understand why the number of 18 is magic in this case. I guess
that there is a relationship between max_expansions and the number of
indexed document. So when the amount of indexed documents increases, I
need to increase max_expansions too or the above query does not work

To build a prefix query, it looks for all terms starting with your
prefix 'y' and adds each term to your query, up to max_expansions. The
terms are sorted alphabetically.

If you've indexed lots more data, than presumably you have a lot more
terms between "ya" and "yo" than you had before.

if you want to do partial matching of words, then a better idea is to
use ngrams or edge-ngrams to index your data up front. it is more
efficient at search time than using a prefix query.

