Wrestling with analyzer

Hi,

At Go About we are using elasticsearch (0.90.2) for geocoding.
Unfortunately, I am running into a problem I could use some help with.

I use the following analyzer configuration:

index:
analysis:
analyzer:
default:
alias: [goabout]
type: custom
tokenizer: standard
filter: [lowercase, synonym, standard, asciifolding]
char_filter: [char_mapper]
postal_code:
tokenizer: keyword
filter: [lowercase]
tokenizer:
standard:
stopwords: []
filter:
synonym:
type: synonym
synonyms:
- st => sint
- den haag => s gravenhage
- den bosch => s hertogenbosch
- jp => jan pieterszoon
- mh => maarten harpertszoon
char_filter:
char_mapper:
type: mapping
mappings:
- ij => y

I then the index the following document:

$ curl -XPUT http://localhost:9200/geocoder/address/1 -d "{"city":
"'s-Gravenhage", "point": {"lat": 52.034608082483366, "lon":
4.266201580347966}, "street": "Wantsnijdersgaarde", "postal_code":
"2542 GN", "housenumber": "573"}"

(We put a mapping first to make sure "point" is a geo_point, but this is
not relevant for this problem.)

The analyzer seems to work correctly:

$ curl -X GET "http://localhost:9200/geocoder/_analyze?pretty=true" -d "Den
Haag"
{
"tokens" : [ {
"token" : "s",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "gravenhage",
"start_offset" : 4,
"end_offset" : 8,
"type" : "SYNONYM",
"position" : 2
} ]
}

The analyzer seems to get use in both indexing and querying, as this query
(that exchanges "y" for "ij") finds the document:

$ curl -X GET
"http://localhost:9200/geocoder/_search?q=Wantsnydersgaarde&analyzer=goabout&pretty=true"

{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.095891505,
"hits" : [ {
"_index" : "geocoder",
"_type" : "address",
"_id" : "1",
"_score" : 0.095891505, "_source" : {"city": "'s-Gravenhage",
"point": {"lat": 52.034608082483366, "lon": 4.266201580347966}, "street":
"Wantsnijdersgaarde", "postal_code": "2542 GN", "housenumber": "573"}
} ]
}
}

But this search query does not return results:

$ curl -X GET
"http://localhost:9200/geocoder/_search?q=den+haag&analyzer=goabout&pretty=true"

{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}

Something is going on with the synonym filter. What can I do to further
debug my problem?

Regards,
Joost

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Can you send your mapping as well? Maybe you don't set the index / search
analyzer for the field "city" in the mapping. if you don't you won't use
the syn. filter there.

simon

On Wednesday, July 31, 2013 4:19:04 PM UTC+2, Joost Cassee wrote:

Hi,

At Go About we are using elasticsearch (0.90.2) for geocoding.
Unfortunately, I am running into a problem I could use some help with.

I use the following analyzer configuration:

index:
analysis:
analyzer:
default:
alias: [goabout]
type: custom
tokenizer: standard
filter: [lowercase, synonym, standard, asciifolding]
char_filter: [char_mapper]
postal_code:
tokenizer: keyword
filter: [lowercase]
tokenizer:
standard:
stopwords: []
filter:
synonym:
type: synonym
synonyms:
- st => sint
- den haag => s gravenhage
- den bosch => s hertogenbosch
- jp => jan pieterszoon
- mh => maarten harpertszoon
char_filter:
char_mapper:
type: mapping
mappings:
- ij => y

I then the index the following document:

$ curl -XPUT http://localhost:9200/geocoder/address/1 -d "{"city":
"'s-Gravenhage", "point": {"lat": 52.034608082483366, "lon":
4.266201580347966}, "street": "Wantsnijdersgaarde", "postal_code":
"2542 GN", "housenumber": "573"}"

(We put a mapping first to make sure "point" is a geo_point, but this is
not relevant for this problem.)

The analyzer seems to work correctly:

$ curl -X GET "http://localhost:9200/geocoder/_analyze?pretty=true" -d
"Den Haag"
{
"tokens" : [ {
"token" : "s",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "gravenhage",
"start_offset" : 4,
"end_offset" : 8,
"type" : "SYNONYM",
"position" : 2
} ]
}

The analyzer seems to get use in both indexing and querying, as this query
(that exchanges "y" for "ij") finds the document:

$ curl -X GET "
http://localhost:9200/geocoder/_search?q=Wantsnydersgaarde&analyzer=goabout&pretty=true"

{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.095891505,
"hits" : [ {
"_index" : "geocoder",
"_type" : "address",
"_id" : "1",
"_score" : 0.095891505, "_source" : {"city": "'s-Gravenhage",
"point": {"lat": 52.034608082483366, "lon": 4.266201580347966}, "street":
"Wantsnijdersgaarde", "postal_code": "2542 GN", "housenumber": "573"}
} ]
}
}

But this search query does not return results:

$ curl -X GET "
http://localhost:9200/geocoder/_search?q=den+haag&analyzer=goabout&pretty=true"

{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}

Something is going on with the synonym filter. What can I do to further
debug my problem?

Regards,
Joost

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Simon,

This is the mapping for the address type:

{
"properties": {
"street": { "type": "string" },
"housenumber": { "type": "string" },
"postal_code": { "type": "string", "analyzer": "postal_code" },
"city": { "type": "string" },
"point": { "type": "geo_point" }
}
}

I expected the city field to be analyzed by the default analyzer,
which is the one I configured, right?

Regards,
Joost

2013/8/2 simonw simon.willnauer@elasticsearch.com:

Can you send your mapping as well? Maybe you don't set the index / search
analyzer for the field "city" in the mapping. if you don't you won't use the
syn. filter there.

simon

On Wednesday, July 31, 2013 4:19:04 PM UTC+2, Joost Cassee wrote:

Hi,

At Go About we are using elasticsearch (0.90.2) for geocoding.
Unfortunately, I am running into a problem I could use some help with.

I use the following analyzer configuration:

index:
analysis:
analyzer:
default:
alias: [goabout]
type: custom
tokenizer: standard
filter: [lowercase, synonym, standard, asciifolding]
char_filter: [char_mapper]
postal_code:
tokenizer: keyword
filter: [lowercase]
tokenizer:
standard:
stopwords: []
filter:
synonym:
type: synonym
synonyms:
- st => sint
- den haag => s gravenhage
- den bosch => s hertogenbosch
- jp => jan pieterszoon
- mh => maarten harpertszoon
char_filter:
char_mapper:
type: mapping
mappings:
- ij => y

I then the index the following document:

$ curl -XPUT http://localhost:9200/geocoder/address/1 -d "{"city":
"'s-Gravenhage", "point": {"lat": 52.034608082483366, "lon":
4.266201580347966}, "street": "Wantsnijdersgaarde", "postal_code":
"2542 GN", "housenumber": "573"}"

(We put a mapping first to make sure "point" is a geo_point, but this is
not relevant for this problem.)

The analyzer seems to work correctly:

$ curl -X GET "http://localhost:9200/geocoder/_analyze?pretty=true" -d
"Den Haag"
{
"tokens" : [ {
"token" : "s",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "gravenhage",
"start_offset" : 4,
"end_offset" : 8,
"type" : "SYNONYM",
"position" : 2
} ]
}

The analyzer seems to get use in both indexing and querying, as this query
(that exchanges "y" for "ij") finds the document:

$ curl -X GET
"http://localhost:9200/geocoder/_search?q=Wantsnydersgaarde&analyzer=goabout&pretty=true"
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.095891505,
"hits" : [ {
"_index" : "geocoder",
"_type" : "address",
"_id" : "1",
"_score" : 0.095891505, "_source" : {"city": "'s-Gravenhage",
"point": {"lat": 52.034608082483366, "lon": 4.266201580347966}, "street":
"Wantsnijdersgaarde", "postal_code": "2542 GN", "housenumber": "573"}
} ]
}
}

But this search query does not return results:

$ curl -X GET
"http://localhost:9200/geocoder/_search?q=den+haag&analyzer=goabout&pretty=true"
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}

Something is going on with the synonym filter. What can I do to further
debug my problem?

Regards,
Joost

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/NHR4uRa0y8E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Joost Cassee
http://joost.cassee.net

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

After some digging I found a way to show the indexed terms, and the
analyzer has performed the "ij" -> "y" mapping:

$ curl 'http://localhost:9200/geocoder/_search?pretty=true' -d '{
"query" : {
"match_all" : { }
},
"script_fields": {
"terms" : {
"script": "doc[field].values",
"params": {
"field": "_all"
}
}

}

}'
{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "geocoder",
"_type" : "address",
"_id" : "1",
"_score" : 1.0,
"fields" : {
"terms" : [ "2542", "573", "gn", "gravenhage", "s",
"wantsnydersgaarde" ]
}
} ]
}
}

Can anything else go wrong with the mapping?

Regards,
Joost

2013/8/5 Joost Cassee joost@cassee.net:

Hi Simon,

This is the mapping for the address type:

{
"properties": {
"street": { "type": "string" },
"housenumber": { "type": "string" },
"postal_code": { "type": "string", "analyzer": "postal_code" },
"city": { "type": "string" },
"point": { "type": "geo_point" }
}
}

I expected the city field to be analyzed by the default analyzer,
which is the one I configured, right?

Regards,
Joost

2013/8/2 simonw simon.willnauer@elasticsearch.com:

Can you send your mapping as well? Maybe you don't set the index / search
analyzer for the field "city" in the mapping. if you don't you won't use the
syn. filter there.

simon

On Wednesday, July 31, 2013 4:19:04 PM UTC+2, Joost Cassee wrote:

Hi,

At Go About we are using elasticsearch (0.90.2) for geocoding.
Unfortunately, I am running into a problem I could use some help with.

I use the following analyzer configuration:

index:
analysis:
analyzer:
default:
alias: [goabout]
type: custom
tokenizer: standard
filter: [lowercase, synonym, standard, asciifolding]
char_filter: [char_mapper]
postal_code:
tokenizer: keyword
filter: [lowercase]
tokenizer:
standard:
stopwords: []
filter:
synonym:
type: synonym
synonyms:
- st => sint
- den haag => s gravenhage
- den bosch => s hertogenbosch
- jp => jan pieterszoon
- mh => maarten harpertszoon
char_filter:
char_mapper:
type: mapping
mappings:
- ij => y

I then the index the following document:

$ curl -XPUT http://localhost:9200/geocoder/address/1 -d "{"city":
"'s-Gravenhage", "point": {"lat": 52.034608082483366, "lon":
4.266201580347966}, "street": "Wantsnijdersgaarde", "postal_code":
"2542 GN", "housenumber": "573"}"

(We put a mapping first to make sure "point" is a geo_point, but this is
not relevant for this problem.)

The analyzer seems to work correctly:

$ curl -X GET "http://localhost:9200/geocoder/_analyze?pretty=true" -d
"Den Haag"
{
"tokens" : [ {
"token" : "s",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "gravenhage",
"start_offset" : 4,
"end_offset" : 8,
"type" : "SYNONYM",
"position" : 2
} ]
}

The analyzer seems to get use in both indexing and querying, as this query
(that exchanges "y" for "ij") finds the document:

$ curl -X GET
"http://localhost:9200/geocoder/_search?q=Wantsnydersgaarde&analyzer=goabout&pretty=true"
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.095891505,
"hits" : [ {
"_index" : "geocoder",
"_type" : "address",
"_id" : "1",
"_score" : 0.095891505, "_source" : {"city": "'s-Gravenhage",
"point": {"lat": 52.034608082483366, "lon": 4.266201580347966}, "street":
"Wantsnijdersgaarde", "postal_code": "2542 GN", "housenumber": "573"}
} ]
}
}

But this search query does not return results:

$ curl -X GET
"http://localhost:9200/geocoder/_search?q=den+haag&analyzer=goabout&pretty=true"
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}

Something is going on with the synonym filter. What can I do to further
debug my problem?

Regards,
Joost

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/NHR4uRa0y8E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Joost Cassee
http://joost.cassee.net

--
Joost Cassee
http://joost.cassee.net

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

In GitHub issue 3494, Igor Motov wrote:

In order to find operators, the query string parser is using spaces
to split query string first and only then it passes each part through
analyzer to produce token. Your query is basically equivalent to the
query "den OR haag":

$ curl -XGET "http://localhost:9200/geocoder/_validate/query?q=den+haag&analyzer=goabout&pretty=true&explain=true"
{
"valid" : true,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"explanations" : [ {
"index" : "test-idx",
"valid" : true,
"explanation" : "_all:den _all:haag"
} ]
}

Thanks for clearing that up!

If you are not going to use any query string query operators,
I would suggest using match query instead of query string query.

Actually, I was initially going to use a match_phrase_prefix query,
but the problem is I want to be able to match across fields:

$ curl -XGET "http://localhost:9200/geocoder/_search/?pretty=true" -d '{
"query": {
"match_phrase_prefix": {
"_all": {
"query": "573 wantsn"
}
}
}
}'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}

Als, matching the prefix of a synonym does not work. (For example
doing a match_phrase_prefix query for "den haa" gives no hits.)

That is why I started using the query_string query. But maybe there is
another technique for doing this.

Regards,
Joost

2013/8/5 Joost Cassee joost@cassee.net:

After some digging I found a way to show the indexed terms, and the
analyzer has performed the "ij" -> "y" mapping:

$ curl 'http://localhost:9200/geocoder/_search?pretty=true' -d '{
"query" : {
"match_all" : { }
},
"script_fields": {
"terms" : {
"script": "doc[field].values",
"params": {
"field": "_all"
}
}

}

}'
{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "geocoder",
"_type" : "address",
"_id" : "1",
"_score" : 1.0,
"fields" : {
"terms" : [ "2542", "573", "gn", "gravenhage", "s",
"wantsnydersgaarde" ]
}
} ]
}
}

Can anything else go wrong with the mapping?

Regards,
Joost

2013/8/5 Joost Cassee joost@cassee.net:

Hi Simon,

This is the mapping for the address type:

{
"properties": {
"street": { "type": "string" },
"housenumber": { "type": "string" },
"postal_code": { "type": "string", "analyzer": "postal_code" },
"city": { "type": "string" },
"point": { "type": "geo_point" }
}
}

I expected the city field to be analyzed by the default analyzer,
which is the one I configured, right?

Regards,
Joost

2013/8/2 simonw simon.willnauer@elasticsearch.com:

Can you send your mapping as well? Maybe you don't set the index / search
analyzer for the field "city" in the mapping. if you don't you won't use the
syn. filter there.

simon

On Wednesday, July 31, 2013 4:19:04 PM UTC+2, Joost Cassee wrote:

Hi,

At Go About we are using elasticsearch (0.90.2) for geocoding.
Unfortunately, I am running into a problem I could use some help with.

I use the following analyzer configuration:

index:
analysis:
analyzer:
default:
alias: [goabout]
type: custom
tokenizer: standard
filter: [lowercase, synonym, standard, asciifolding]
char_filter: [char_mapper]
postal_code:
tokenizer: keyword
filter: [lowercase]
tokenizer:
standard:
stopwords: []
filter:
synonym:
type: synonym
synonyms:
- st => sint
- den haag => s gravenhage
- den bosch => s hertogenbosch
- jp => jan pieterszoon
- mh => maarten harpertszoon
char_filter:
char_mapper:
type: mapping
mappings:
- ij => y

I then the index the following document:

$ curl -XPUT http://localhost:9200/geocoder/address/1 -d "{"city":
"'s-Gravenhage", "point": {"lat": 52.034608082483366, "lon":
4.266201580347966}, "street": "Wantsnijdersgaarde", "postal_code":
"2542 GN", "housenumber": "573"}"

(We put a mapping first to make sure "point" is a geo_point, but this is
not relevant for this problem.)

The analyzer seems to work correctly:

$ curl -X GET "http://localhost:9200/geocoder/_analyze?pretty=true" -d
"Den Haag"
{
"tokens" : [ {
"token" : "s",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "gravenhage",
"start_offset" : 4,
"end_offset" : 8,
"type" : "SYNONYM",
"position" : 2
} ]
}

The analyzer seems to get use in both indexing and querying, as this query
(that exchanges "y" for "ij") finds the document:

$ curl -X GET
"http://localhost:9200/geocoder/_search?q=Wantsnydersgaarde&analyzer=goabout&pretty=true"
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.095891505,
"hits" : [ {
"_index" : "geocoder",
"_type" : "address",
"_id" : "1",
"_score" : 0.095891505, "_source" : {"city": "'s-Gravenhage",
"point": {"lat": 52.034608082483366, "lon": 4.266201580347966}, "street":
"Wantsnijdersgaarde", "postal_code": "2542 GN", "housenumber": "573"}
} ]
}
}

But this search query does not return results:

$ curl -X GET
"http://localhost:9200/geocoder/_search?q=den+haag&analyzer=goabout&pretty=true"
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}

Something is going on with the synonym filter. What can I do to further
debug my problem?

Regards,
Joost

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/NHR4uRa0y8E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Joost Cassee
http://joost.cassee.net

--
Joost Cassee
http://joost.cassee.net

--
Joost Cassee
http://joost.cassee.net

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.