GeoNames, Autocomplete and boost


(Info Cascade) #1

Hi. I'm trying to improve autocomplete search results on a GeoNames Cities
index.
I have been using django-haystack, but have run into issues there. I may
need to replace it, or bypass it. But my question here pertains to
indexing and querying with autocomplete using multiple fields.

Users expect to be able to use two-letter abbreviations for states to
narrow their city choices. For example,
"San Francisco, CA" and "New York, NY" should have the cities you'd expect
at the top of the list.
However that is not the case, and I think for different reasons. You can
see the results below.

It turns out that there are a lot of San Franciscos in the world!
Searching for "San Francisco CA" retrieves

San Francisco, Caraga, 13, PH 5.5191193
San Francisco, Caraga, 13, PH 5.5163627
San Francisco, Calabarzon, 40, PH 5.4498897
San Francisco, Calabarzon, 40, PH 5.281434
San Francisco, Caraga, 13, PH 5.281434
San Francisco, California, CA, US 5.2123656
South San Francisco, California, CA, US 4.3138
San Francisco (El Calvito), Chiapas, 05, MX 4.137272
San Francisco, Baja California Sur, 03, MX 4.137272
San Francisco (BaƱos de Agua Caliente), Guanajuato, 11, MX 3.3008962

I would like to boost the state (region_code) value so that San Francisco
and South San Francisco are at the top.

For "New York NY" I get

Nyack, New York, US 3.0575132
West Nyack, New York, US 2.670291
South Nyack, New York, US 2.5124028
Upper Nyack, New York, US 2.5124028

Instead of what I want, which is "New York City, New York, US".

The autocomplete field is EdgeNGram called "content_auto". It currently
has the following format, which is what I want to return: "CityName,
RegionName, CountryCode."

So I think what I want to do in both cases is boost results if there is a
match on the region_code field, but not display the region_code field in
the results.

The type of the search is currently query_string, which is what haystack
uses. If there is some way to make that work, then that would be good.
However, I'm afraid it is limiting what I'm able to do.

I did some experiments --
If I query directly with curl for sf using

{
"query":{
"multi_match":{
"query": "San Francisco CA",
"type": "cross_fields",
"fields": ["content_auto", "region_code^3"]
}
}
}

I get a result I'm satisfied with. However the similar query using "New
York NY" puts the city as the sixth result! I also tried putting the
region_code in the content_auto string, and boosting the region_code field.
Also, the following works for SF, but I have no way of knowing in advance
what the region_code is going to be. It ranks New York City third, and I
would have to pick out two-letter combinations.
"default_field": "text",
"default_operator": "OR",
"query": "(content_auto:(san) AND content_auto:(francisco))
CA^1.5"

It would really help if someone could help me limit my own queries about
how ElasticSearch works, so that I can focus on the best approach!

Thanks in advance for your help :slight_smile:

curl 'localhost:9200/cities/_mapping?&pretty'
{
"cities" : {
"mappings" : {
"modelresult" : {
"_boost" : {
"name" : "boost",
"null_value" : 1.0
},
"properties" : {
"content_auto" : {
"type" : "string",
"analyzer" : "edgengram_analyzer"
},
"django_ct" : {
"type" : "string",
"index" : "not_analyzed",
"include_in_all" : false
},
"django_id" : {
"type" : "string",
"index" : "not_analyzed",
"include_in_all" : false
},
"id" : {
"type" : "string"
},
"location" : {
"type" : "geo_point"
},
"region_code" : {
"type" : "string",
"analyzer" : "snowball"
},
"text" : {
"type" : "string",
"analyzer" : "snowball"
}
}
}
}
}
}

NY example:
curl -XGET 'http://localhost:9200/cities/modelresult/_search?pretty' -d '{
"from": 0,
"query": {
"filtered": {
"filter": {
"terms": {
"django_ct": [
"cities.city"
]
}
},
"query": {
"query_string": {
"analyze_wildcard": true,
"auto_generate_phrase_queries": true,
"default_field": "text",
"default_operator": "AND",
"query": "(content_auto:(new) AND content_auto:(york,) AND
content_auto:(ny))"
}
}
}
},
"size": 10,
"sort": [
{
"_score": {
"order": "desc"
}
}
]
}'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 3.0575132,
"hits" : [ {
"_index" : "cities",
"_type" : "modelresult",
"_id" : "cities.city.5129433",
"_score" : 3.0575132,
"_source":{"django_id": "5129433", "region_code": "NY", "text":
"Nyack\nNew York\nNY\nUnited States\nUS\n", "django_ct": "cities.city",
"location": "41.09065,-73.91791", "content_auto": "Nyack, New York, US",
"id": "cities.city.5129433"}
}, {
"_index" : "cities",
"_type" : "modelresult",
"_id" : "cities.city.5143946",
"_score" : 2.670291,
"_source":{"django_id": "5143946", "region_code": "NY", "text": "West
Nyack\nNew York\nNY\nUnited States\nUS\n", "django_ct": "cities.city",
"location": "41.09648,-73.97292", "content_auto": "West Nyack, New York,
US", "id": "cities.city.5143946"}
}, {
"_index" : "cities",
"_type" : "modelresult",
"_id" : "cities.city.5138940",
"_score" : 2.5124028,
"_source":{"django_id": "5138940", "region_code": "NY", "text":
"South Nyack\nNew York\nNY\nUnited States\nUS\n", "django_ct":
"cities.city", "location": "41.08315,-73.92014", "content_auto": "South
Nyack, New York, US", "id": "cities.city.5138940"}
}, {
"_index" : "cities",
"_type" : "modelresult",
"_id" : "cities.city.5142011",
"_score" : 2.5124028,
"_source":{"django_id": "5142011", "region_code": "NY", "text":
"Upper Nyack\nNew York\nNY\nUnited States\nUS\n", "django_ct":
"cities.city", "location": "41.10704,-73.92014", "content_auto": "Upper
Nyack, New York, US", "id": "cities.city.5142011"}
} ]
}
}

}'

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8722fa24-8172-4a89-b9c1-39bd70f60da3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #2