Hi,
We're using Elasticsearch with an Analyzer to map the y
character to
ij
, (char_fitler named "char_mapper") since in Dutch these two are
"somewhat" interchangeable. We're also using a lowercase filter.
This is the configuration:
{
"analysis": {
"analyzer": {
"index": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_twoway",
"standard",
"asciifolding"
],
"char_filter": [
"char_mapper"
]
},
"index_prefix": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_twoway",
"standard",
"asciifolding",
"prefixes"
],
"char_filter": [
"char_mapper"
]
},
"search": {
"alias": [
"default"
],
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym",
"synonym_twoway",
"standard",
"asciifolding"
],
"char_filter": [
"char_mapper"
]
},
"postal_code": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"standard": {
"stopwords": [
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [
"st => sint",
"jp => jan pieterszoon",
"mh => maarten harpertszoon"
]
},
"synonym_twoway": {
"type": "synonym",
"synonyms": [
"den haag, s gravenhage",
"den bosch, s hertogenbosch"
]
},
"prefixes": {
"type": "edgeNGram",
"side": "front",
"min_gram": 1,
"max_gram": 30
}
},
"char_filter": {
"char_mapper": {
"type": "mapping",
"mappings": [
"y => ij"
]
}
}
}
}
When indexing cities, we're using this mapping:
{
"properties": {
"city": {
"type": "multi_field",
"fields": {
"city": {
"type": "string"
},
"prefix": {
"type": "string",
"boost": 0.5,
"index_analyzer": "index_prefix"
}
}
},
"province_code": {
"type": "string"
},
"unique_name": {
"type": "boolean"
},
"point": {
"type": "geo_point"
},
"search_terms": {
"type": "multi_field",
"fields": {
"search_terms": {
"type": "string"
},
"prefix": {
"boost": 0.5,
"index_analyzer": "index_prefix",
"type": "string"
}
}
}
},
"search_analyzer": "search",
"index_analyzer": "index"
}
When we index all the (Dutch) cities from our data-source, there are cities
starting with both IJ
and Y
. (for example, these citiy names exist:
IJssel, IJsselstein, Yerseke and Ysselsteyn.) It seems that these
characters are not lowercased before the char_mapping is applied.
Querying the index, results in
/top/city/_search?q=ijsselstein -> works, returns the document for
IJsselstein
/top/city/_search?q=Ijsselstein -> works, returns the document for
IJsselstein
/top/city/_search?q=yerseke -> *doesn't *work, returns nothing
/top/city/_search?q=Yerseke -> *does *work, returns the document for Yerseke
/top/city/_search?q=YsselsteYn -> *doesn't *work, returns nothing
/top/city/_search?q=Ysselsteyn -> *does *work, returns the document for
Ysselsteyn
Changing the case of any other letter doesn't affect the results.
I've worked around this issue by adding the mapping "Y => ij", i.e.:
"char_filter": {
"char_mapper": {
"type": "mapping",
"mappings": [
"y => ij",
"Y => ij"
]
}
}
This solves the problem, but I'd rather see that the lowercase filter is
applied before the mapping, or, that I can make the order explicit. Is
there any stance on this issue? Or is this intended behaviour?
Regards,
Matthias Hogerheijde
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c60de452-2a3f-42f7-a677-956f81ecec17%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.