ElasticSearch won't recongize char_filter mapping

brian_yoder · January 14, 2013, 10:01pm

In summary: Everything is currently working, except for char_filter mapping.

I'm currently on ElasticSearch 19.10 because it works fine for our current
production application, and I am not extending its use to require any of
the bug fixes in the change logs for more recent versions.

I've isolated this issue to configured analyzers and a collection of HTTP
_analyze requests that easily reproduce the problem. No additional data or
queries are needed at this point (I don't believe, anyway).

Here is the example I found at
http://www.elasticsearch.org/guide/reference/index-modules/analysis/mapping-charfilter.html

{
"index" : {
"analysis" : {
"char_filter" : {
"my_mapping" : {
"type" : "mapping",
"mappings" : ["ph=>f", "qu=>q"]
}
},
"analyzer" : {
"custom_with_char_filter" : {
"tokenizer" : "standard",
"char_filter" : ["my_mapping"]
},
}
}
}
}

Here is what my elasticsearch.yml configuration looks like. Note the
Finnish character mappings that are typical for searching Finnish names:
The previous example didn't quite work with what I need: A snowball
stemming tokenizer with Finnish stemming rules, no stop words, and
convering w to v on the input string before tokenizing. After playing
around a little, here's what works (except for the char_filter):

index:
analysis:
char_filter:
finnish_char_mapping:
type: mapping
mappings: [ "Å=>O", "å=>o", "W=>V", "w=>v" ]
analyzer:
# Default uses snowball stemming analyzer with no stop words
# with the default language per the JVM:
default:
type: snowball
stopwords: none
# Per-language analyzers
english_standard:
type: standard
language: English
stopwords: none
english_stemming:
type: snowball
language: English
stopwords: none
finnish_stemming:
type: snowball
language: Finnish
char_filter: [finnish_char_mapping]
stopwords: none

This first analyze operation returns the expected tokens. It analyzes the
text using the standard analyzer, and stop words are included:

$ curl -XGET 'localhost:9200/sgen/_analyze?analyzer=standard&pretty=true'
-d 'Debby Debbie and Walter' && echo
{
"tokens" : [ {
"token" : "debby",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "debbie",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "walter",
"start_offset" : 17,
"end_offset" : 23,
"type" : "",
"position" : 4
} ]
}

This also works: It uses the snowball analyzer with the English language
and with stop words included in the list of tokens as desired:

$ curl -XGET
'localhost:9200/sgen/_analyze?analyzer=english_stemming&pretty=true' -d
'Debby Debbie and Walter' && echo
{
"tokens" : [ {
"token" : "debbi",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "debbi",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "and",
"start_offset" : 13,
"end_offset" : 16,
"type" : "",
"position" : 3
}, {
"token" : "walter",
"start_offset" : 17,
"end_offset" : 23,
"type" : "",
"position" : 4
} ]
}

But this doesn't fully work. It uses the Finnish stemming rules (to the
best of my knowledge; the tokens are different than those created using the
English snowball stemming rules). But it does not honor the character
mapping: I would have expected "valter" and not "walter" as the last token
string. And of course, a search for valter won't match walter and this
analysis token issue is likely the root cause:

$ curl -XGET
'localhost:9200/sgen/_analyze?analyzer=finnish_stemming&pretty=true' -d
'Debby Debbie and Walter' && echo
{
"tokens" : [ {
"token" : "deby",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "debie",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "and",
"start_offset" : 13,
"end_offset" : 16,
"type" : "",
"position" : 3
}, {
"token" : "walter",
"start_offset" : 17,
"end_offset" : 23,
"type" : "",
"position" : 4
} ]
}

I cannot get ElasticSearch to define the analyzers and mappings when
creating an index: There aren't any examples of both, and my
experimentation yields mappings that can only point to configured
analyzers. So configuring a list of analyzers is a currently acceptable
work-around.

But honoring the char_filter mapping is something that is necessary to
resolve.

Thank you in advance.

--

Igor_Motov · January 15, 2013, 2:13am

I don't think you can put a char_filter on existing analyzer, I think you
need to create a custom analyzer instead:

index:
analysis:
char_filter:
finnish_char_mapping:
type: mapping
mappings: [ "Å=>O", "å=>o", "W=>V", "w=>v" ]
analyzer:
# Default uses snowball stemming analyzer with no stop words
# with the default language per the JVM:
default:
type: snowball
stopwords: none
# Per-language analyzers
english_standard:
type: standard
language: English
stopwords: none
english_stemming:
type: snowball
language: English
stopwords: none
finnish_stemming:
type: custom
tokenizer: standard
filter: [standard, lowercase, finnish_snowball]
char_filter: [finnish_char_mapping]
filter:
finnish_snowball:
type: snowball
language: Finnish

On Monday, January 14, 2013 5:01:44 PM UTC-5, InquiringMind wrote:

In summary: Everything is currently working, except for char_filter
mapping.

I'm currently on Elasticsearch 19.10 because it works fine for our current
production application, and I am not extending its use to require any of
the bug fixes in the change logs for more recent versions.

I've isolated this issue to configured analyzers and a collection of HTTP
_analyze requests that easily reproduce the problem. No additional data or
queries are needed at this point (I don't believe, anyway).

Here is the example I found at
Elasticsearch Platform — Find real-time answers at scale | Elastic

{
"index" : {
"analysis" : {
"char_filter" : {
"my_mapping" : {
"type" : "mapping",
"mappings" : ["ph=>f", "qu=>q"]
}
},
"analyzer" : {
"custom_with_char_filter" : {
"tokenizer" : "standard",
"char_filter" : ["my_mapping"]
},
}
}
}
}

Here is what my elasticsearch.yml configuration looks like. Note the
Finnish character mappings that are typical for searching Finnish names:
The previous example didn't quite work with what I need: A snowball
stemming tokenizer with Finnish stemming rules, no stop words, and
convering w to v on the input string before tokenizing. After playing
around a little, here's what works (except for the char_filter):

index:
analysis:
char_filter:
finnish_char_mapping:
type: mapping
mappings: [ "Å=>O", "å=>o", "W=>V", "w=>v" ]
analyzer:
# Default uses snowball stemming analyzer with no stop words
# with the default language per the JVM:
default:
type: snowball
stopwords: none
# Per-language analyzers
english_standard:
type: standard
language: English
stopwords: none
english_stemming:
type: snowball
language: English
stopwords: none
finnish_stemming:
type: snowball
language: Finnish
char_filter: [finnish_char_mapping]
stopwords: none

This first analyze operation returns the expected tokens. It analyzes the
text using the standard analyzer, and stop words are included:

$ curl -XGET 'localhost:9200/sgen/_analyze?analyzer=standard&pretty=true'
-d 'Debby Debbie and Walter' && echo
{
"tokens" : [ {
"token" : "debby",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "debbie",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "walter",
"start_offset" : 17,
"end_offset" : 23,
"type" : "",
"position" : 4
} ]
}

This also works: It uses the snowball analyzer with the English language
and with stop words included in the list of tokens as desired:

$ curl -XGET
'localhost:9200/sgen/_analyze?analyzer=english_stemming&pretty=true' -d
'Debby Debbie and Walter' && echo
{
"tokens" : [ {
"token" : "debbi",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "debbi",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "and",
"start_offset" : 13,
"end_offset" : 16,
"type" : "",
"position" : 3
}, {
"token" : "walter",
"start_offset" : 17,
"end_offset" : 23,
"type" : "",
"position" : 4
} ]
}

But this doesn't fully work. It uses the Finnish stemming rules (to the
best of my knowledge; the tokens are different than those created using the
English snowball stemming rules). But it does not honor the character
mapping: I would have expected "valter" and not "walter" as the last token
string. And of course, a search for valter won't match walter and this
analysis token issue is likely the root cause:

$ curl -XGET
'localhost:9200/sgen/_analyze?analyzer=finnish_stemming&pretty=true' -d
'Debby Debbie and Walter' && echo
{
"tokens" : [ {
"token" : "deby",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "debie",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "and",
"start_offset" : 13,
"end_offset" : 16,
"type" : "",
"position" : 3
}, {
"token" : "walter",
"start_offset" : 17,
"end_offset" : 23,
"type" : "",
"position" : 4
} ]
}

I cannot get Elasticsearch to define the analyzers and mappings when
creating an index: There aren't any examples of both, and my
experimentation yields mappings that can only point to configured
analyzers. So configuring a list of analyzers is a currently acceptable
work-around.

But honoring the char_filter mapping is something that is necessary to
resolve.

Thank you in advance.

--

brian_yoder · January 15, 2013, 4:12pm

Thank you very much, Igor! I wouldn't have guessed it, but from your
example it now makes sense.

I added the stopwords : none configuration line to the finnish_stemminganalyzer. It seems to work (that is, it tokenizes stop words) for the few
Finnish stop word examples I could find. Though my very limited knowledge
of Finnish doesn't let me verify the behavior.

I also created a counterpart that is non-stemming but still performs
character replacements. Here is my current configuration for analyzers:

index:
analysis:
char_filter:
finnish_char_mapping:
type: mapping
mappings: [ "Å=>O", "å=>o", "W=>V", "w=>v" ]
filter:
finnish_standard:
type: standard
language: Finnish
finnish_snowball:
type: snowball
language: Finnish
analyzer:
# Default analyzer uses the snowball stemming analyzer with no
# stop words, all defined by the default language per the JVM:
default:
type: snowball
stopwords: none
# Per-language analyzers: No stemming
english_standard:
type: standard
language: English
stopwords: none
finnish_standard:
type: custom
tokenizer: standard
filter: [standard, lowercase, finnish_standard]
char_filter: [finnish_char_mapping]
stopwords: none
# Per-language analyzers: Stemming
english_stemming:
type: snowball
language: English
stopwords: none
finnish_stemming:
type: custom
tokenizer: standard
filter: [standard, lowercase, finnish_snowball]
char_filter: [finnish_char_mapping]
stopwords: none

Thanks again!

On Monday, January 14, 2013 9:13:19 PM UTC-5, Igor Motov wrote:

I don't think you can put a char_filter on existing analyzer, I think you
need to create a custom analyzer instead:

index:
analysis:
char_filter:
finnish_char_mapping:
type: mapping
mappings: [ "Å=>O", "å=>o", "W=>V", "w=>v" ]
analyzer:
# Default uses snowball stemming analyzer with no stop words
# with the default language per the JVM:
default:
type: snowball
stopwords: none
# Per-language analyzers
english_standard:
type: standard
language: English
stopwords: none
english_stemming:
type: snowball
language: English
stopwords: none
finnish_stemming:
type: custom
tokenizer: standard
filter: [standard, lowercase, finnish_snowball]
char_filter: [finnish_char_mapping]
filter:
finnish_snowball:
type: snowball
language: Finnish

On Monday, January 14, 2013 5:01:44 PM UTC-5, InquiringMind wrote:

In summary: Everything is currently working, except for char_filter
mapping.

I'm currently on Elasticsearch 19.10 because it works fine for our
current production application, and I am not extending its use to require
any of the bug fixes in the change logs for more recent versions.

I've isolated this issue to configured analyzers and a collection of HTTP
_analyze requests that easily reproduce the problem. No additional data or
queries are needed at this point (I don't believe, anyway).

Here is the example I found at
Elasticsearch Platform — Find real-time answers at scale | Elastic

{
"index" : {
"analysis" : {
"char_filter" : {
"my_mapping" : {
"type" : "mapping",
"mappings" : ["ph=>f", "qu=>q"]
}
},
"analyzer" : {
"custom_with_char_filter" : {
"tokenizer" : "standard",
"char_filter" : ["my_mapping"]
},
}
}
}
}

Here is what my elasticsearch.yml configuration looks like. Note the
Finnish character mappings that are typical for searching Finnish names:
The previous example didn't quite work with what I need: A snowball
stemming tokenizer with Finnish stemming rules, no stop words, and
convering w to v on the input string before tokenizing. After playing
around a little, here's what works (except for the char_filter):

index:
analysis:
char_filter:
finnish_char_mapping:
type: mapping
mappings: [ "Å=>O", "å=>o", "W=>V", "w=>v" ]
analyzer:
# Default uses snowball stemming analyzer with no stop words
# with the default language per the JVM:
default:
type: snowball
stopwords: none
# Per-language analyzers
english_standard:
type: standard
language: English
stopwords: none
english_stemming:
type: snowball
language: English
stopwords: none
finnish_stemming:
type: snowball
language: Finnish
char_filter: [finnish_char_mapping]
stopwords: none

This first analyze operation returns the expected tokens. It analyzes the
text using the standard analyzer, and stop words are included:

$ curl -XGET 'localhost:9200/sgen/_analyze?analyzer=standard&pretty=true'
-d 'Debby Debbie and Walter' && echo
{
"tokens" : [ {
"token" : "debby",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "debbie",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "walter",
"start_offset" : 17,
"end_offset" : 23,
"type" : "",
"position" : 4
} ]
}

This also works: It uses the snowball analyzer with the English language
and with stop words included in the list of tokens as desired:

$ curl -XGET
'localhost:9200/sgen/_analyze?analyzer=english_stemming&pretty=true' -d
'Debby Debbie and Walter' && echo
{
"tokens" : [ {
"token" : "debbi",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "debbi",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "and",
"start_offset" : 13,
"end_offset" : 16,
"type" : "",
"position" : 3
}, {
"token" : "walter",
"start_offset" : 17,
"end_offset" : 23,
"type" : "",
"position" : 4
} ]
}

But this doesn't fully work. It uses the Finnish stemming rules (to the
best of my knowledge; the tokens are different than those created using the
English snowball stemming rules). But it does not honor the character
mapping: I would have expected "valter" and not "walter" as the last token
string. And of course, a search for valter won't match walter and this
analysis token issue is likely the root cause:

$ curl -XGET
'localhost:9200/sgen/_analyze?analyzer=finnish_stemming&pretty=true' -d
'Debby Debbie and Walter' && echo
{
"tokens" : [ {
"token" : "deby",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "debie",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "and",
"start_offset" : 13,
"end_offset" : 16,
"type" : "",
"position" : 3
}, {
"token" : "walter",
"start_offset" : 17,
"end_offset" : 23,
"type" : "",
"position" : 4
} ]
}

I cannot get Elasticsearch to define the analyzers and mappings when
creating an index: There aren't any examples of both, and my
experimentation yields mappings that can only point to configured
analyzers. So configuring a list of analyzers is a currently acceptable
work-around.

But honoring the char_filter mapping is something that is necessary to
resolve.

Thank you in advance.

--

Igor_Motov · January 15, 2013, 6:49pm

The stopwords: none part in the finnish_stemming shouldn't be necessary.
I intentionally omitted the stopword filter from the filter list. A typical
definition of a snowball analyzer is this:

my_stopword_analyzer:
type: custom
tokenizer: standrad
filter: [standard, lowercase, stop, snowball]

and the stop word filter (stop) has to be configured if you want
non-default behavior. Since I didn't specify the stop in my analyzer
definition there is really nothing to configure there.

On Tuesday, January 15, 2013 11:12:54 AM UTC-5, InquiringMind wrote:

Thank you very much, Igor! I wouldn't have guessed it, but from your
example it now makes sense.

I added the stopwords : none configuration line to the finnish_stemminganalyzer. It seems to work (that is, it tokenizes stop words) for the few
Finnish stop word examples I could find. Though my very limited knowledge
of Finnish doesn't let me verify the behavior.

I also created a counterpart that is non-stemming but still performs
character replacements. Here is my current configuration for analyzers:

index:
analysis:
char_filter:
finnish_char_mapping:
type: mapping
mappings: [ "Å=>O", "å=>o", "W=>V", "w=>v" ]
filter:
finnish_standard:
type: standard
language: Finnish
finnish_snowball:
type: snowball
language: Finnish
analyzer:
# Default analyzer uses the snowball stemming analyzer with no
# stop words, all defined by the default language per the JVM:
default:
type: snowball
stopwords: none
# Per-language analyzers: No stemming
english_standard:
type: standard
language: English
stopwords: none
finnish_standard:
type: custom
tokenizer: standard
filter: [standard, lowercase, finnish_standard]
char_filter: [finnish_char_mapping]
stopwords: none
# Per-language analyzers: Stemming
english_stemming:
type: snowball
language: English
stopwords: none
finnish_stemming:
type: custom
tokenizer: standard
filter: [standard, lowercase, finnish_snowball]
char_filter: [finnish_char_mapping]
stopwords: none

Thanks again!

On Monday, January 14, 2013 9:13:19 PM UTC-5, Igor Motov wrote:

I don't think you can put a char_filter on existing analyzer, I think you
need to create a custom analyzer instead:

index:
analysis:
char_filter:
finnish_char_mapping:
type: mapping
mappings: [ "Å=>O", "å=>o", "W=>V", "w=>v" ]
analyzer:
# Default uses snowball stemming analyzer with no stop words
# with the default language per the JVM:
default:
type: snowball
stopwords: none
# Per-language analyzers
english_standard:
type: standard
language: English
stopwords: none
english_stemming:
type: snowball
language: English
stopwords: none
finnish_stemming:
type: custom
tokenizer: standard
filter: [standard, lowercase, finnish_snowball]
char_filter: [finnish_char_mapping]
filter:
finnish_snowball:
type: snowball
language: Finnish

On Monday, January 14, 2013 5:01:44 PM UTC-5, InquiringMind wrote:

In summary: Everything is currently working, except for char_filter
mapping.

I'm currently on Elasticsearch 19.10 because it works fine for our
current production application, and I am not extending its use to require
any of the bug fixes in the change logs for more recent versions.

I've isolated this issue to configured analyzers and a collection of
HTTP _analyze requests that easily reproduce the problem. No additional
data or queries are needed at this point (I don't believe, anyway).

Here is the example I found at
Elasticsearch Platform — Find real-time answers at scale | Elastic

{
"index" : {
"analysis" : {
"char_filter" : {
"my_mapping" : {
"type" : "mapping",
"mappings" : ["ph=>f", "qu=>q"]
}
},
"analyzer" : {
"custom_with_char_filter" : {
"tokenizer" : "standard",
"char_filter" : ["my_mapping"]
},
}
}
}
}

Here is what my elasticsearch.yml configuration looks like. Note the
Finnish character mappings that are typical for searching Finnish names:
The previous example didn't quite work with what I need: A snowball
stemming tokenizer with Finnish stemming rules, no stop words, and
convering w to v on the input string before tokenizing. After playing
around a little, here's what works (except for the char_filter):

index:
analysis:
char_filter:
finnish_char_mapping:
type: mapping
mappings: [ "Å=>O", "å=>o", "W=>V", "w=>v" ]
analyzer:
# Default uses snowball stemming analyzer with no stop words
# with the default language per the JVM:
default:
type: snowball
stopwords: none
# Per-language analyzers
english_standard:
type: standard
language: English
stopwords: none
english_stemming:
type: snowball
language: English
stopwords: none
finnish_stemming:
type: snowball
language: Finnish
char_filter: [finnish_char_mapping]
stopwords: none

This first analyze operation returns the expected tokens. It analyzes
the text using the standard analyzer, and stop words are included:

$ curl -XGET
'localhost:9200/sgen/_analyze?analyzer=standard&pretty=true' -d 'Debby
Debbie and Walter' && echo
{
"tokens" : [ {
"token" : "debby",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "debbie",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "walter",
"start_offset" : 17,
"end_offset" : 23,
"type" : "",
"position" : 4
} ]
}

This also works: It uses the snowball analyzer with the English language
and with stop words included in the list of tokens as desired:

$ curl -XGET
'localhost:9200/sgen/_analyze?analyzer=english_stemming&pretty=true' -d
'Debby Debbie and Walter' && echo
{
"tokens" : [ {
"token" : "debbi",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "debbi",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "and",
"start_offset" : 13,
"end_offset" : 16,
"type" : "",
"position" : 3
}, {
"token" : "walter",
"start_offset" : 17,
"end_offset" : 23,
"type" : "",
"position" : 4
} ]
}

But this doesn't fully work. It uses the Finnish stemming rules (to the
best of my knowledge; the tokens are different than those created using the
English snowball stemming rules). But it does not honor the character
mapping: I would have expected "valter" and not "walter" as the last token
string. And of course, a search for valter won't match walter and this
analysis token issue is likely the root cause:

$ curl -XGET
'localhost:9200/sgen/_analyze?analyzer=finnish_stemming&pretty=true' -d
'Debby Debbie and Walter' && echo
{
"tokens" : [ {
"token" : "deby",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 1
}, {
"token" : "debie",
"start_offset" : 6,
"end_offset" : 12,
"type" : "",
"position" : 2
}, {
"token" : "and",
"start_offset" : 13,
"end_offset" : 16,
"type" : "",
"position" : 3
}, {
"token" : "walter",
"start_offset" : 17,
"end_offset" : 23,
"type" : "",
"position" : 4
} ]
}

I cannot get Elasticsearch to define the analyzers and mappings when
creating an index: There aren't any examples of both, and my
experimentation yields mappings that can only point to configured
analyzers. So configuring a list of analyzers is a currently acceptable
work-around.

But honoring the char_filter mapping is something that is necessary to
resolve.

Thank you in advance.

--

brian_yoder · January 15, 2013, 7:59pm

Aha! Now it's falling into place. Your last comment was needed (for me,
anyway) to make the on-line guide clearer. Yes, the stopwords statement is
not needed because the analyzer is being constructed without the stopfilter.

I went back and re-created my analyzers based on your guidance,
constructing them out of their individual parts (almost, lowercase is
really a higher performance combination!), and omitting the stop filter. My
remaining questions are:

Is the language statement (with the language name in lowercase) still
recommended even for a non-stemming analyzer that doesn't include a stopfilter. For example:

english_standard:
type: custom
tokenizer: standard
filter: [standard, lowercase]
language: english
When adding the type and field mappings to an index create command, what
is the JSON format for adding the analyzer definitions? No matter how much
I play around with the format, the mappings don't seem to be able to find
any analyzer that I defined in the API; only the pre-configured analyzers
are seen. This isn't as high priority, but it would be nice to allow the
most flexibility without the need to pre-configure every supported language.

And thanks again! You've really helped make the existing documentation come
to life!

On Tuesday, January 15, 2013 1:49:27 PM UTC-5, Igor Motov wrote:

The stopwords: none part in the finnish_stemming shouldn't be
necessary. I intentionally omitted the stopword filter from the filter
list. A typical definition of a snowball analyzer is this:

my_stopword_analyzer:
type: custom
tokenizer: standrad
filter: [standard, lowercase, stop, snowball]

and the stop word filter (stop) has to be configured if you want
non-default behavior. Since I didn't specify the stop in my analyzer
definition there is really nothing to configure there.

--

Igor_Motov · January 15, 2013, 11:09pm

You are configuring a custom analyzer from pieces here. Custom analyzer
doesn't use the language parameter because it doesn't have any
language-specific components. It just glue with which you can create
analyzer out of pieces: char filters, tokenizer and token filters. The
language parameter should be used when you configure individual pieces such
as stopword or snowball filters or built-in analyzers such as snowball.
I think I already answered this earlier today.

On Tuesday, January 15, 2013 2:59:59 PM UTC-5, InquiringMind wrote:

Aha! Now it's falling into place. Your last comment was needed (for me,
anyway) to make the on-line guide clearer. Yes, the stopwords statement
is not needed because the analyzer is being constructed without the stopfilter.

I went back and re-created my analyzers based on your guidance,
constructing them out of their individual parts (almost, lowercase is
really a higher performance combination!), and omitting the stop filter. My
remaining questions are:

Is the language statement (with the language name in lowercase)
still recommended even for a non-stemming analyzer that doesn't include a
stop filter. For example:

english_standard:
type: custom
tokenizer: standard
filter: [standard, lowercase]
language: english

When adding the type and field mappings to an index create command,
what is the JSON format for adding the analyzer definitions? No matter how
much I play around with the format, the mappings don't seem to be able to
find any analyzer that I defined in the API; only the pre-configured
analyzers are seen. This isn't as high priority, but it would be nice to
allow the most flexibility without the need to pre-configure every
supported language.

And thanks again! You've really helped make the existing documentation
come to life!

On Tuesday, January 15, 2013 1:49:27 PM UTC-5, Igor Motov wrote:

The stopwords: none part in the finnish_stemming shouldn't be
necessary. I intentionally omitted the stopword filter from the filter
list. A typical definition of a snowball analyzer is this:

my_stopword_analyzer:
type: custom
tokenizer: standrad
filter: [standard, lowercase, stop, snowball]

and the stop word filter (stop) has to be configured if you want
non-default behavior. Since I didn't specify the stop in my analyzer
definition there is really nothing to configure there.

--

Topic		Replies	Views
Issues with custom mappings: language, stop word settings, and character replacement Elasticsearch	5	496	July 6, 2017
Override built-in analyzer Elasticsearch	6	481	July 6, 2017
Analyzer configuration Elasticsearch	11	588	July 6, 2017
ElasticSearch with stemming/snwoball Elasticsearch	9	593	July 6, 2017
Exception while creating a custom analyzer Elasticsearch	10	564	July 6, 2017

ElasticSearch won't recongize char_filter mapping

Related topics