Override built-in analyzer

sashao · December 3, 2013, 8:05pm

Hello friends,

I'm trying to preserve specific characters during tokenization using
word_delimiter filter by defining the type_table (as described in
http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html).
Actually my idea is to override the built-in English analyzer by including
custom configured "word_delimiter" ("type_table": ["# => ALPHA", "@ =>
ALPHA"]) filter, but I cannot find any way to do it.
I also tried to create a custom english analyzer but still getting next
problems:

I don't actually know the default settings of the built-in english
analyzer (But I really want to preserve it)
While trying to set "tokenizer": english getting an error on creating
index, saying that english tokenizer is not found.
I'm using 0.90.5 ES

Hope for your kind help!
Sasha

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/567ca0cf-d5d1-4e55-9e8c-b7b1833bb47c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Kurt_Hurtado · December 3, 2013, 8:42pm

Hi Sasha,
Would you mind posting the full curl commands or some other representation
of the settings and mappings you're creating?
Thanks!

On Tuesday, December 3, 2013 12:05:21 PM UTC-8, Sasha Ostrikov wrote:

Hello friends,

I'm trying to preserve specific characters during tokenization using
word_delimiter filter by defining the type_table (as described in
The domain name Fullscale.co is for sale | Dan.com
).
Actually my idea is to override the built-in English analyzer by including
custom configured "word_delimiter" ("type_table": ["# => ALPHA", "@ =>
ALPHA"]) filter, but I cannot find any way to do it.
I also tried to create a custom english analyzer but still getting next
problems:

I don't actually know the default settings of the built-in english
analyzer (But I really want to preserve it)

While trying to set "tokenizer": english getting an error on creating
index, saying that english tokenizer is not found.
I'm using 0.90.5 ES

Hope for your kind help!
Sasha

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/39104357-6606-4b89-af71-6bb7bc74d8e2%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

sashao · December 4, 2013, 9:59am

Sure, so this is my configuration (using Sense plugin for chrome):
POST _template/temp1
{
"template": "",
"order": "5",
"settings": {
"index": {
"analysis": {
"filter": {
"word_delimiter_filter": {
"type": "word_delimiter",
"generate_word_parts": false,
"catenate_words": true,
"split_on_numerics": false,
"preserve_original": true,
"type_table": [
"# => ALPHA",
"@ => ALPHA",
"% => ALPHA",
"$ => ALPHA",
"% => ALPHA"
]
},
"stop_english": {
"type": "stop",
"stopwords": [
"english"
]
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
},
"analyzer": {
"english": { //HERE I'M TRYING TO OVERRIDE THE BUILT IN english
ANALYZER
"filter": [
"word_delimiter_filter"
]
},
"english2": { //HERE I'M TRYING TO CONFIG MY OWN english
ANALYZER THAT WOULD BEHAVE LIKE THE BUILT IN
"type": "custom",
"tokenizer": "english",
"filter": [
"lowercase",
"word_delimiter_filter",
"english_stemmer",
"stop_english"
]
}
}
}
}
},
"mappings": {
"default": {
"dynamic_templates": [
{
"template_textEnglish": {
"match": "text.English.",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english",
"term_vector": "with_positions_offsets"
}
}
},
{
"template_textEnglish": {
"match": "text.English2.*",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english2",
"term_vector": "with_positions_offsets"
}
}
}
]
}
}
}

and this is the error I get trying to create a new index:
{
"error": "IndexCreationException[[test1] failed to create index];
nested: ElasticSearchIllegalArgumentException[failed to find analyzer type
[null] or tokenizer for [english]]; ",
"status": 400
}

On Tuesday, December 3, 2013 10:42:41 PM UTC+2, Kurt Hurtado wrote:

Hi Sasha,
Would you mind posting the full curl commands or some other representation
of the settings and mappings you're creating?
Thanks!

On Tuesday, December 3, 2013 12:05:21 PM UTC-8, Sasha Ostrikov wrote:

Hello friends,

I'm trying to preserve specific characters during tokenization using
word_delimiter filter by defining the type_table (as described in
The domain name Fullscale.co is for sale | Dan.com
).
Actually my idea is to override the built-in English analyzer by
including custom configured "word_delimiter" ("type_table": ["# => ALPHA",
"@ => ALPHA"]) filter, but I cannot find any way to do it.
I also tried to create a custom english analyzer but still getting next
problems:

I don't actually know the default settings of the built-in english
analyzer (But I really want to preserve it)

While trying to set "tokenizer": english getting an error on creating
index, saying that english tokenizer is not found.
I'm using 0.90.5 ES

Hope for your kind help!
Sasha

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8cbc14ec-abf9-48b0-84f5-c4f3b9d1060e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · December 4, 2013, 10:28am

english tokenizer does not exist.
english analyzer uses a standard tokenizer.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 4 décembre 2013 at 10:59:08, Sasha Ostrikov (alexander.ostrikov@gmail.com) a écrit:

Sure, so this is my configuration (using Sense plugin for chrome):
POST _template/temp1
{
"template": "",
"order": "5",
"settings": {
"index": {
"analysis": {
"filter": {
"word_delimiter_filter": {
"type": "word_delimiter",
"generate_word_parts": false,
"catenate_words": true,
"split_on_numerics": false,
"preserve_original": true,
"type_table": [
"# => ALPHA",
"@ => ALPHA",
"% => ALPHA",
"$ => ALPHA",
"% => ALPHA"
]
},
"stop_english": {
"type": "stop",
"stopwords": [
"english"
]
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
},
"analyzer": {
"english": { //HERE I'M TRYING TO OVERRIDE THE BUILT IN english ANALYZER
"filter": [
"word_delimiter_filter"
]
},
"english2": { //HERE I'M TRYING TO CONFIG MY OWN english ANALYZER THAT WOULD BEHAVE LIKE THE BUILT IN
"type": "custom",
"tokenizer": "english",
"filter": [
"lowercase",
"word_delimiter_filter",
"english_stemmer",
"stop_english"
]
}
}
}
}
},
"mappings": {
"default": {
"dynamic_templates": [
{
"template_textEnglish": {
"match": "text.English.",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english",
"term_vector": "with_positions_offsets"
}
}
},
{
"template_textEnglish": {
"match": "text.English2.*",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english2",
"term_vector": "with_positions_offsets"
}
}
}
]
}
}
}

and this is the error I get trying to create a new index:
{
"error": "IndexCreationException[[test1] failed to create index]; nested: ElasticSearchIllegalArgumentException[failed to find analyzer type [null] or tokenizer for [english]]; ",
"status": 400
}

On Tuesday, December 3, 2013 10:42:41 PM UTC+2, Kurt Hurtado wrote:
Hi Sasha,
Would you mind posting the full curl commands or some other representation of the settings and mappings you're creating?
Thanks!

On Tuesday, December 3, 2013 12:05:21 PM UTC-8, Sasha Ostrikov wrote:
Hello friends,

I'm trying to preserve specific characters during tokenization using word_delimiter filter by defining the type_table (as described in http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html).
Actually my idea is to override the built-in English analyzer by including custom configured "word_delimiter" ("type_table": ["# => ALPHA", "@ => ALPHA"]) filter, but I cannot find any way to do it.
I also tried to create a custom english analyzer but still getting next problems:

I don't actually know the default settings of the built-in english analyzer (But I really want to preserve it)
While trying to set "tokenizer": english getting an error on creating index, saying that english tokenizer is not found.
I'm using 0.90.5 ES

Hope for your kind help!
Sasha

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8cbc14ec-abf9-48b0-84f5-c4f3b9d1060e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.529f03c5.2901d82.bd3d%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/groups/opt_out.

sashao · December 4, 2013, 12:43pm

David, thanks for your answer!
So, I"ll try to sum up the things untill this point:

My general purpose is to preserve specific characters during
tokenization, so if someone searchs for "25$" he will find docs with 25$
only and not docs with other occurrences of 25.
To do it I thought to customize the "word_delimiter" filter in with
custom "type_table": ["$ => ALPHA"].
Now I need to include this filter into analyzer definition, so I thought
about two options:
a. override the built in "english" analizer - I'm not sure how do it and
if that's possible at all, but that would probably most convinient solution
for the specific problem.
b. create custom english analyzer - the problem, I'm not sure what is the
right filters list to put there to* preserve the built-in english tokenizer
behaviour.*
So untill this point and using David's comment, I thought about
following definition:
"settings": {
"index": {
"analysis": {
"filter": {
"custom_word_delimiter": {
"type": "word_delimiter",
"generate_word_parts": false,
"catenate_words": true,
"split_on_numerics": false,
"preserve_original": true,
"type_table": [ "# => ALPHA"]
},
"stop_english": {
"type": "stop",
"stopwords": [
"english"
]
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
},
- "analyzer": {*

```
     "my_custom_english": {*
```
```
       "type": "custom",*
```
```
       "tokenizer": "english",*
```
```
       "filter": [*
```
```
         "lowercase",*
```
```
         "custom_word_delimiter",*
```
```
         "english_stemmer",*
```
```
         "stop_english"*
```
```
       ]*
```
```
     }*
  }
}
```
}

So the question can that do the work and is that an optimal solution (I
need to support several languages) for the problem.

Thanks!!!
Sasha

On Wednesday, December 4, 2013 12:28:21 PM UTC+2, David Pilato wrote:

english tokenizer does not exist.
english analyzer uses a standard tokenizer.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 4 décembre 2013 at 10:59:08, Sasha Ostrikov (alexander...@gmail.com<javascript:>)
a écrit:

Sure, so this is my configuration (using Sense plugin for chrome):
POST _template/temp1
{
"template": "",
"order": "5",
"settings": {
"index": {
"analysis": {
"filter": {
"word_delimiter_filter": {
"type": "word_delimiter",
"generate_word_parts": false,
"catenate_words": true,
"split_on_numerics": false,
"preserve_original": true,
"type_table": [
"# => ALPHA",
"@ => ALPHA",
"% => ALPHA",
"$ => ALPHA",
"% => ALPHA"
]
},
"stop_english": {
"type": "stop",
"stopwords": [
"english"
]
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
},
"analyzer": {
"english": { //HERE I'M TRYING TO OVERRIDE THE BUILT IN
english ANALYZER
"filter": [
"word_delimiter_filter"
]
},
"english2": { //HERE I'M TRYING TO CONFIG MY OWN english
ANALYZER THAT WOULD BEHAVE LIKE THE BUILT IN
"type": "custom",
"tokenizer": "english",
"filter": [
"lowercase",
"word_delimiter_filter",
"english_stemmer",
"stop_english"
]
}
}
}
}
},
"mappings": {
"default": {
"dynamic_templates": [
{
"template_textEnglish": {
"match": "text.English.",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english",
"term_vector": "with_positions_offsets"
}
}
},
{
"template_textEnglish": {
"match": "text.English2.*",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english2",
"term_vector": "with_positions_offsets"
}
}
}
]
}
}
}

and this is the error I get trying to create a new index:
{
"error": "IndexCreationException[[test1] failed to create index];
nested: ElasticSearchIllegalArgumentException[failed to find analyzer type
[null] or tokenizer for [english]]; ",
"status": 400
}

On Tuesday, December 3, 2013 10:42:41 PM UTC+2, Kurt Hurtado wrote:

Hi Sasha,
Would you mind posting the full curl commands or some other
representation of the settings and mappings you're creating?
Thanks!

On Tuesday, December 3, 2013 12:05:21 PM UTC-8, Sasha Ostrikov wrote:

Hello friends,

I'm trying to preserve specific characters during tokenization using
word_delimiter filter by defining the type_table (as described in
The domain name Fullscale.co is for sale | Dan.com
).
Actually my idea is to override the built-in English analyzer by
including custom configured "word_delimiter" ("type_table": ["# => ALPHA",
"@ => ALPHA"]) filter, but I cannot find any way to do it.
I also tried to create a custom english analyzer but still getting next
problems:

I don't actually know the default settings of the built-in english
analyzer (But I really want to preserve it)

While trying to set "tokenizer": english getting an error on creating
index, saying that english tokenizer is not found.
I'm using 0.90.5 ES

Hope for your kind help!
Sasha

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8cbc14ec-abf9-48b0-84f5-c4f3b9d1060e%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a4f7be8f-d5e4-453a-ac0d-1406ff8c69b3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

sashao · December 4, 2013, 12:46pm

David, thanks for your answer!
So, I"ll try to sum up the things untill this point:

My general purpose is to preserve specific characters during
tokenization, so if someone searchs for "25$" he will find docs with 25$
only and not docs with other occurrences of 25.
To do it I thought to customize the "word_delimiter" filter in with
custom "type_table": ["$ => ALPHA"].
Now I need to include this filter into analyzer definition, so I thought
about two options:
a. override the built in "english" analizer - I'm not sure how do it and
if that's possible at all, but that would probably most convinient solution
for the specific problem.
b. create custom english analyzer - the problem, I'm not sure what is the
right filters list to put there to* preserve the built-in english tokenizer
behaviour.*
So untill this point and using David's comment, I thought about
following definition:
"settings": {
"index": {
"analysis": {
"filter": {
"custom_word_delimiter": {
"type": "word_delimiter",
"generate_word_parts": false,
"catenate_words": true,
"split_on_numerics": false,
"preserve_original": true,
"type_table": [ "# => ALPHA"]
},
"stop_english": {
"type": "stop",
"stopwords": [
"english"
]
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
},
- "analyzer": {*

```
     "my_custom_english": {*
```
```
       "type": "custom",*
```
```
       "tokenizer": "standard",*
```
```
       "filter": [*
```
```
         "lowercase",*
```
```
         "custom_word_delimiter",*
```
```
         "english_stemmer",*
```
```
         "stop_english"*
```
```
       ]*
```
```
     }*
  }
}
```
}

So the question can that do the work and is that an optimal solution (I
need to support several languages) for the problem.

Thanks!!!
Sasha

On Wednesday, December 4, 2013 12:28:21 PM UTC+2, David Pilato wrote:

english tokenizer does not exist.
english analyzer uses a standard tokenizer.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 4 décembre 2013 at 10:59:08, Sasha Ostrikov (alexander...@gmail.com<javascript:>)
a écrit:

Sure, so this is my configuration (using Sense plugin for chrome):
POST _template/temp1
{
"template": "",
"order": "5",
"settings": {
"index": {
"analysis": {
"filter": {
"word_delimiter_filter": {
"type": "word_delimiter",
"generate_word_parts": false,
"catenate_words": true,
"split_on_numerics": false,
"preserve_original": true,
"type_table": [
"# => ALPHA",
"@ => ALPHA",
"% => ALPHA",
"$ => ALPHA",
"% => ALPHA"
]
},
"stop_english": {
"type": "stop",
"stopwords": [
"english"
]
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
},
"analyzer": {
"english": { //HERE I'M TRYING TO OVERRIDE THE BUILT IN
english ANALYZER
"filter": [
"word_delimiter_filter"
]
},
"english2": { //HERE I'M TRYING TO CONFIG MY OWN english
ANALYZER THAT WOULD BEHAVE LIKE THE BUILT IN
"type": "custom",
"tokenizer": "english",
"filter": [
"lowercase",
"word_delimiter_filter",
"english_stemmer",
"stop_english"
]
}
}
}
}
},
"mappings": {
"default": {
"dynamic_templates": [
{
"template_textEnglish": {
"match": "text.English.",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english",
"term_vector": "with_positions_offsets"
}
}
},
{
"template_textEnglish": {
"match": "text.English2.*",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english2",
"term_vector": "with_positions_offsets"
}
}
}
]
}
}
}

and this is the error I get trying to create a new index:
{
"error": "IndexCreationException[[test1] failed to create index];
nested: ElasticSearchIllegalArgumentException[failed to find analyzer type
[null] or tokenizer for [english]]; ",
"status": 400
}

On Tuesday, December 3, 2013 10:42:41 PM UTC+2, Kurt Hurtado wrote:

Hi Sasha,
Would you mind posting the full curl commands or some other
representation of the settings and mappings you're creating?
Thanks!

On Tuesday, December 3, 2013 12:05:21 PM UTC-8, Sasha Ostrikov wrote:

Hello friends,

I'm trying to preserve specific characters during tokenization using
word_delimiter filter by defining the type_table (as described in
The domain name Fullscale.co is for sale | Dan.com
).
Actually my idea is to override the built-in English analyzer by
including custom configured "word_delimiter" ("type_table": ["# => ALPHA",
"@ => ALPHA"]) filter, but I cannot find any way to do it.
I also tried to create a custom english analyzer but still getting next
problems:

I don't actually know the default settings of the built-in english
analyzer (But I really want to preserve it)

While trying to set "tokenizer": english getting an error on creating
index, saying that english tokenizer is not found.
I'm using 0.90.5 ES

Hope for your kind help!
Sasha

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8cbc14ec-abf9-48b0-84f5-c4f3b9d1060e%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cde0ffeb-9452-4ffd-ba95-b40f18c2533c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Override built-in analyzer Elasticsearch	2	399	July 6, 2017
Exception while creating a custom analyzer Elasticsearch	10	564	July 6, 2017
ElasticSearch won't recongize char_filter mapping Elasticsearch	6	1081	July 6, 2017
Different behaviour b/w custom and original Word Delimiter Token Filter Elasticsearch	5	396	July 6, 2017
Help with analyzer and mapping Elasticsearch	9	554	July 6, 2017

Override built-in analyzer

Hope for your kind help! Sasha

Related topics

Hope for your kind help!
Sasha