In summary: Everything is currently working, except for char_filter mapping.
I'm currently on ElasticSearch 19.10 because it works fine for our current
production application, and I am not extending its use to require any of
the bug fixes in the change logs for more recent versions.
I've isolated this issue to configured analyzers and a collection of HTTP
_analyze requests that easily reproduce the problem. No additional data or
queries are needed at this point (I don't believe, anyway).
Here is the example I found at
http://www.elasticsearch.org/guide/reference/index-modules/analysis/mapping-charfilter.html
{
"index" : {
"analysis" : {
"char_filter" : {
"my_mapping" : {
"type" : "mapping",
"mappings" : ["ph=>f", "qu=>q"]
}
},
"analyzer" : {
"custom_with_char_filter" : {
"tokenizer" : "standard",
"char_filter" : ["my_mapping"]
},
}
}
}
}
Here is what my elasticsearch.yml configuration looks like. Note the
Finnish character mappings that are typical for searching Finnish names:
The previous example didn't quite work with what I need: A snowball
stemming tokenizer with Finnish stemming rules, no stop words, and
convering w to v on the input string before tokenizing. After playing
around a little, here's what works (except for the char_filter):
index:
analysis:
char_filter:
finnish_char_mapping:
type: mapping
mappings: [ "Å=>O", "å=>o", "W=>V", "w=>v" ]
analyzer:
# Default uses snowball stemming analyzer with no stop words
# with the default language per the JVM:
default:
type: snowball
stopwords: none
# Per-language analyzers
english_standard:
type: standard
language: English
stopwords: none
english_stemming:
type: snowball
language: English
stopwords: none
finnish_stemming:
type: snowball
language: Finnish
char_filter: [finnish_char_mapping]
stopwords: none
This first analyze operation returns the expected tokens. It analyzes the
text using the standard analyzer, and stop words are included:
$ curl -XGET 'localhost:9200/sgen/_analyze?analyzer=standard&pretty=true'
-d 'Debby Debbie and Walter' && echo
{
"tokens" : [ {
"token" : "debby",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "debbie",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "walter",
"start_offset" : 17,
"end_offset" : 23,
"type" : "",
"position" : 4
} ]
}
This also works: It uses the snowball analyzer with the English language
and with stop words included in the list of tokens as desired:
$ curl -XGET
'localhost:9200/sgen/_analyze?analyzer=english_stemming&pretty=true' -d
'Debby Debbie and Walter' && echo
{
"tokens" : [ {
"token" : "debbi",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "debbi",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "and",
"start_offset" : 13,
"end_offset" : 16,
"type" : "",
"position" : 3
}, {
"token" : "walter",
"start_offset" : 17,
"end_offset" : 23,
"type" : "",
"position" : 4
} ]
}
But this doesn't fully work. It uses the Finnish stemming rules (to the
best of my knowledge; the tokens are different than those created using the
English snowball stemming rules). But it does not honor the character
mapping: I would have expected "valter" and not "walter" as the last token
string. And of course, a search for valter won't match walter and this
analysis token issue is likely the root cause:
$ curl -XGET
'localhost:9200/sgen/_analyze?analyzer=finnish_stemming&pretty=true' -d
'Debby Debbie and Walter' && echo
{
"tokens" : [ {
"token" : "deby",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "debie",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "and",
"start_offset" : 13,
"end_offset" : 16,
"type" : "",
"position" : 3
}, {
"token" : "walter",
"start_offset" : 17,
"end_offset" : 23,
"type" : "",
"position" : 4
} ]
}
I cannot get ElasticSearch to define the analyzers and mappings when
creating an index: There aren't any examples of both, and my
experimentation yields mappings that can only point to configured
analyzers. So configuring a list of analyzers is a currently acceptable
work-around.
But honoring the char_filter mapping is something that is necessary to
resolve.
Thank you in advance.
--