The relation of the amount of indexed documents, max_expansions and prefix_length in text_phrase_prefix

pcdinh · November 29, 2012, 12:02pm

Hi all,

I have an issue with ES 0.19.3 regarding to text_phrase_prefix query. When
number of documents indexed in ES is small the following query
works perfectly (type "New Y", not "New York" or "New Yo")

curl -X DELETE http://es1:9200/cities
curl -X POST "http://es1:9200/cities/city" -d '{ "city" : "New York" }'
curl -X POST "http://es1:9200/cities/city" -d '{ "city" : "North New York"
}'
curl -X POST "http://es1:9200/cities/city" -d '{ "city" : "East New York" }'

curl -XGET http://es1:9200/cities/city/_search?pretty=true -d'
{
"fields":[
"city"
],
"query":{
"text_phrase_prefix": {
"city" : {
"query": "New Y",
"max_expansions": 2,
"prefix_length": 2
}
}
},
"from":0,
"size":20
}'

returns all cities or areas has "New York" in their names.

{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.38356602,
"hits" : [ {
"_index" : "cities",
"_type" : "city",
"_id" : "4ObkgggqS7uou1XLdwOkfA",
"_score" : 0.38356602,
"fields" : {
"city" : "New York"
}
}, {
"_index" : "cities",
"_type" : "city",
"_id" : "CZutMgvwSfa8O79Vajkshg",
"_score" : 0.30685282,
"fields" : {
"city" : "North New York"
}
}, {
"_index" : "cities",
"_type" : "city",
"_id" : "ZGA3gno9QnOIBg2MxxsPbg",
"_score" : 0.30685282,
"fields" : {
"city" : "East New York"
}
} ]
}
}

However when the number of indexed document grows up (more than 30 000
cities or towns or areas in US), the above query does not work any more.
I need to increase max_expansions into a number that is greater than 17 (18
and greater to be specific) to make it work again. Any number that
is smaller than 17 does not work. If I don't increase max_expansions, I
need to use keywords like: "New Yo" or "New York"

curl -XGET http://184.72.29.x:9200/cities/city/_search?pretty=true -d'
{
"fields":[
"area_label"
],
"query":{
"text_phrase_prefix": {
"area_label" : {
"query": "New Y",
"max_expansions": 18,
"prefix_length": 2
}
}
},
"from":0,
"size":20
}
'

returns

{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 111.86473,
"hits" : [ {
"_index" : "cities",
"_type" : "city",
"_id" : "195232",
"_score" : 111.86473,
"fields" : {
"area_label" : "New York"
}
}, {
"_index" : "cities",
"_type" : "city",
"_id" : "46727",
"_score" : 89.49178,
"fields" : {
"area_label" : "North New York"
}
}, {
"_index" : "cities",
"_type" : "city",
"_id" : "46772",
"_score" : 89.49178,
"fields" : {
"area_label" : "East New York"
}
} ]
}
}

prefix_length does not play any role in this case. I increase the value of
prefix_length to 20, the result is still the same.

I don't understand why the number of 18 is magic in this case. I guess that
there is a relationship between max_expansions and the number of
indexed document. So when the amount of indexed documents increases, I need
to increase max_expansions too or the above query does not work again.

Am I missing something?

Regards,

Dinh

--

Clinton_Gormley · November 29, 2012, 12:49pm

Hi Dinh

However when the number of indexed document grows up (more than 30 000
cities or towns or areas in US), the above query does not work any
more.
I need to increase max_expansions into a number that is greater than
17 (18 and greater to be specific) to make it work again. Any number
that
is smaller than 17 does not work. If I don't increase max_expansions,
I need to use keywords like: "New Yo" or "New York"

prefix_length does not play any role in this case. I increase the
value of prefix_length to 20, the result is still the same.

prefix_length is for fuzzy queries, not phrase_prefix

I don't understand why the number of 18 is magic in this case. I guess
that there is a relationship between max_expansions and the number of
indexed document. So when the amount of indexed documents increases, I
need to increase max_expansions too or the above query does not work
again.

To build a prefix query, it looks for all terms starting with your
prefix 'y' and adds each term to your query, up to max_expansions. The
terms are sorted alphabetically.

If you've indexed lots more data, than presumably you have a lot more
terms between "ya" and "yo" than you had before.

if you want to do partial matching of words, then a better idea is to
use ngrams or edge-ngrams to index your data up front. it is more
efficient at search time than using a prefix query.

clint

--

Topic		Replies	Views
Text_phrase_prefix not giving the required results, the syntax is exactly as explained at "http://www.elasticsearch.org/guide/reference/query-dsl/text-query.html" Elasticsearch	2	262	July 6, 2017
Inconsistent results for the text_phrase_prefix search Elasticsearch	3	400	July 6, 2017
Match_phrase_prefix acts erratically when there is a filter Elasticsearch	5	1643	July 5, 2017
Multi_match phrase_prefix is giving different results for Elastic versions 6.6 and 8.4 Elasticsearch	2	223	November 26, 2022
0.90: search_string vs match_phrase_prefix: same query different results Elasticsearch	2	435	July 6, 2017

The relation of the amount of indexed documents, max_expansions and prefix_length in text_phrase_prefix

Related topics