Override built-in analyzer


(sashao) #1

Hello friends,

I'm trying to preserve specific characters during tokenization using
word_delimiter filter by defining the type_table (as described in
http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html).
Actually my idea is to override the built-in English analyzer by including
custom configured "word_delimiter" ("type_table": ["# => ALPHA", "@ =>
ALPHA"]) filter, but I cannot find any way to do it.
I also tried to create a custom english analyzer but still getting next
problems:

  1. I don't actually know the default settings of the built-in english
    analyzer (But I really want to preserve it)
  2. While trying to set "tokenizer": english getting an error on creating
    index, saying that english tokenizer is not found.
    I'm using 0.90.5 ES

Hope for your kind help!
Sasha

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/567ca0cf-d5d1-4e55-9e8c-b7b1833bb47c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Kurt Hurtado) #2

Hi Sasha,
Would you mind posting the full curl commands or some other representation
of the settings and mappings you're creating?
Thanks!

On Tuesday, December 3, 2013 12:05:21 PM UTC-8, Sasha Ostrikov wrote:

Hello friends,

I'm trying to preserve specific characters during tokenization using
word_delimiter filter by defining the type_table (as described in
http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html
).
Actually my idea is to override the built-in English analyzer by including
custom configured "word_delimiter" ("type_table": ["# => ALPHA", "@ =>
ALPHA"]) filter, but I cannot find any way to do it.
I also tried to create a custom english analyzer but still getting next
problems:

  1. I don't actually know the default settings of the built-in english
    analyzer (But I really want to preserve it)
  2. While trying to set "tokenizer": english getting an error on creating
    index, saying that english tokenizer is not found.
    I'm using 0.90.5 ES

Hope for your kind help!
Sasha

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/39104357-6606-4b89-af71-6bb7bc74d8e2%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(sashao) #3

Sure, so this is my configuration (using Sense plugin for chrome):
POST _template/temp1
{
"template": "",
"order": "5",
"settings": {
"index": {
"analysis": {
"filter": {
"word_delimiter_filter": {
"type": "word_delimiter",
"generate_word_parts": false,
"catenate_words": true,
"split_on_numerics": false,
"preserve_original": true,
"type_table": [
"# => ALPHA",
"@ => ALPHA",
"% => ALPHA",
"$ => ALPHA",
"% => ALPHA"
]
},
"stop_english": {
"type": "stop",
"stopwords": [
"english"
]
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
},
"analyzer": {
"english": { //HERE I'M TRYING TO OVERRIDE THE BUILT IN english
ANALYZER
"filter": [
"word_delimiter_filter"
]
},
"english2": { //HERE I'M TRYING TO CONFIG MY OWN english
ANALYZER THAT WOULD BEHAVE LIKE THE BUILT IN
"type": "custom",
"tokenizer": "english",
"filter": [
"lowercase",
"word_delimiter_filter",
"english_stemmer",
"stop_english"
]
}
}
}
}
},
"mappings": {
"default": {
"dynamic_templates": [
{
"template_textEnglish": {
"match": "text.English.
",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english",
"term_vector": "with_positions_offsets"
}
}
},
{
"template_textEnglish": {
"match": "text.English2.*",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english2",
"term_vector": "with_positions_offsets"
}
}
}
]
}
}
}

and this is the error I get trying to create a new index:
{
"error": "IndexCreationException[[test1] failed to create index];
nested: ElasticSearchIllegalArgumentException[failed to find analyzer type
[null] or tokenizer for [english]]; ",
"status": 400
}

On Tuesday, December 3, 2013 10:42:41 PM UTC+2, Kurt Hurtado wrote:

Hi Sasha,
Would you mind posting the full curl commands or some other representation
of the settings and mappings you're creating?
Thanks!

On Tuesday, December 3, 2013 12:05:21 PM UTC-8, Sasha Ostrikov wrote:

Hello friends,

I'm trying to preserve specific characters during tokenization using
word_delimiter filter by defining the type_table (as described in
http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html
).
Actually my idea is to override the built-in English analyzer by
including custom configured "word_delimiter" ("type_table": ["# => ALPHA",
"@ => ALPHA"]) filter, but I cannot find any way to do it.
I also tried to create a custom english analyzer but still getting next
problems:

  1. I don't actually know the default settings of the built-in english
    analyzer (But I really want to preserve it)
  2. While trying to set "tokenizer": english getting an error on creating
    index, saying that english tokenizer is not found.
    I'm using 0.90.5 ES

Hope for your kind help!
Sasha

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8cbc14ec-abf9-48b0-84f5-c4f3b9d1060e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #4

english tokenizer does not exist.
english analyzer uses a standard tokenizer.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 4 décembre 2013 at 10:59:08, Sasha Ostrikov (alexander.ostrikov@gmail.com) a écrit:

Sure, so this is my configuration (using Sense plugin for chrome):
POST _template/temp1
{
"template": "",
"order": "5",
"settings": {
"index": {
"analysis": {
"filter": {
"word_delimiter_filter": {
"type": "word_delimiter",
"generate_word_parts": false,
"catenate_words": true,
"split_on_numerics": false,
"preserve_original": true,
"type_table": [
"# => ALPHA",
"@ => ALPHA",
"% => ALPHA",
"$ => ALPHA",
"% => ALPHA"
]
},
"stop_english": {
"type": "stop",
"stopwords": [
"english"
]
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
},
"analyzer": {
"english": { //HERE I'M TRYING TO OVERRIDE THE BUILT IN english ANALYZER
"filter": [
"word_delimiter_filter"
]
},
"english2": { //HERE I'M TRYING TO CONFIG MY OWN english ANALYZER THAT WOULD BEHAVE LIKE THE BUILT IN
"type": "custom",
"tokenizer": "english",
"filter": [
"lowercase",
"word_delimiter_filter",
"english_stemmer",
"stop_english"
]
}
}
}
}
},
"mappings": {
"default": {
"dynamic_templates": [
{
"template_textEnglish": {
"match": "text.English.
",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english",
"term_vector": "with_positions_offsets"
}
}
},
{
"template_textEnglish": {
"match": "text.English2.*",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english2",
"term_vector": "with_positions_offsets"
}
}
}
]
}
}
}

and this is the error I get trying to create a new index:
{
"error": "IndexCreationException[[test1] failed to create index]; nested: ElasticSearchIllegalArgumentException[failed to find analyzer type [null] or tokenizer for [english]]; ",
"status": 400
}

On Tuesday, December 3, 2013 10:42:41 PM UTC+2, Kurt Hurtado wrote:
Hi Sasha,
Would you mind posting the full curl commands or some other representation of the settings and mappings you're creating?
Thanks!

On Tuesday, December 3, 2013 12:05:21 PM UTC-8, Sasha Ostrikov wrote:
Hello friends,

I'm trying to preserve specific characters during tokenization using word_delimiter filter by defining the type_table (as described in http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html).
Actually my idea is to override the built-in English analyzer by including custom configured "word_delimiter" ("type_table": ["# => ALPHA", "@ => ALPHA"]) filter, but I cannot find any way to do it.
I also tried to create a custom english analyzer but still getting next problems:

  1. I don't actually know the default settings of the built-in english analyzer (But I really want to preserve it)
  2. While trying to set "tokenizer": english getting an error on creating index, saying that english tokenizer is not found.
    I'm using 0.90.5 ES

Hope for your kind help!
Sasha

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8cbc14ec-abf9-48b0-84f5-c4f3b9d1060e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.529f03c5.2901d82.bd3d%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/groups/opt_out.


(sashao) #5

David, thanks for your answer!
So, I"ll try to sum up the things untill this point:

  1. My general purpose is to preserve specific characters during
    tokenization, so if someone searchs for "25$" he will find docs with 25$
    only and not docs with other occurrences of 25.
  2. To do it I thought to customize the "word_delimiter" filter in with
    custom "type_table": ["$ => ALPHA"].
  3. Now I need to include this filter into analyzer definition, so I thought
    about two options:
    a. override the built in "english" analizer - I'm not sure how do it and
    if that's possible at all, but that would probably most convinient solution
    for the specific problem.
    b. create custom english analyzer - the problem, I'm not sure what is the
    right filters list to put there to* preserve the built-in english tokenizer
    behaviour.*
    So untill this point and using David's comment, I thought about
    following definition:
    "settings": {
    "index": {
    "analysis": {
    "filter": {
    "custom_word_delimiter": {
    "type": "word_delimiter",
    "generate_word_parts": false,
    "catenate_words": true,
    "split_on_numerics": false,
    "preserve_original": true,
    "type_table": [ "# => ALPHA"]
    },
    "stop_english": {
    "type": "stop",
    "stopwords": [
    "english"
    ]
    },
    "english_stemmer": {
    "type": "stemmer",
    "name": "english"
    }
    },
    • "analyzer": {*
  •      "my_custom_english": {*
    
  •        "type": "custom",*
    
  •        "tokenizer": "english",*
    
  •        "filter": [*
    
  •          "lowercase",*
    
  •          "custom_word_delimiter",*
    
  •          "english_stemmer",*
    
  •          "stop_english"*
    
  •        ]*
    
  •      }*
      }
    }
    
    }

So the question can that do the work and is that an optimal solution (I
need to support several languages) for the problem.

Thanks!!!
Sasha

On Wednesday, December 4, 2013 12:28:21 PM UTC+2, David Pilato wrote:

english tokenizer does not exist.
english analyzer uses a standard tokenizer.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 4 décembre 2013 at 10:59:08, Sasha Ostrikov (alexander...@gmail.com<javascript:>)
a écrit:

Sure, so this is my configuration (using Sense plugin for chrome):
POST _template/temp1
{
"template": "",
"order": "5",
"settings": {
"index": {
"analysis": {
"filter": {
"word_delimiter_filter": {
"type": "word_delimiter",
"generate_word_parts": false,
"catenate_words": true,
"split_on_numerics": false,
"preserve_original": true,
"type_table": [
"# => ALPHA",
"@ => ALPHA",
"% => ALPHA",
"$ => ALPHA",
"% => ALPHA"
]
},
"stop_english": {
"type": "stop",
"stopwords": [
"english"
]
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
},
"analyzer": {
"english": { //HERE I'M TRYING TO OVERRIDE THE BUILT IN
english ANALYZER
"filter": [
"word_delimiter_filter"
]
},
"english2": { //HERE I'M TRYING TO CONFIG MY OWN english
ANALYZER THAT WOULD BEHAVE LIKE THE BUILT IN
"type": "custom",
"tokenizer": "english",
"filter": [
"lowercase",
"word_delimiter_filter",
"english_stemmer",
"stop_english"
]
}
}
}
}
},
"mappings": {
"default": {
"dynamic_templates": [
{
"template_textEnglish": {
"match": "text.English.
",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english",
"term_vector": "with_positions_offsets"
}
}
},
{
"template_textEnglish": {
"match": "text.English2.*",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english2",
"term_vector": "with_positions_offsets"
}
}
}
]
}
}
}

and this is the error I get trying to create a new index:
{
"error": "IndexCreationException[[test1] failed to create index];
nested: ElasticSearchIllegalArgumentException[failed to find analyzer type
[null] or tokenizer for [english]]; ",
"status": 400
}

On Tuesday, December 3, 2013 10:42:41 PM UTC+2, Kurt Hurtado wrote:

Hi Sasha,
Would you mind posting the full curl commands or some other
representation of the settings and mappings you're creating?
Thanks!

On Tuesday, December 3, 2013 12:05:21 PM UTC-8, Sasha Ostrikov wrote:

Hello friends,

I'm trying to preserve specific characters during tokenization using
word_delimiter filter by defining the type_table (as described in
http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html
).
Actually my idea is to override the built-in English analyzer by
including custom configured "word_delimiter" ("type_table": ["# => ALPHA",
"@ => ALPHA"]) filter, but I cannot find any way to do it.
I also tried to create a custom english analyzer but still getting next
problems:

  1. I don't actually know the default settings of the built-in english
    analyzer (But I really want to preserve it)
  2. While trying to set "tokenizer": english getting an error on creating
    index, saying that english tokenizer is not found.
    I'm using 0.90.5 ES

Hope for your kind help!
Sasha

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8cbc14ec-abf9-48b0-84f5-c4f3b9d1060e%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a4f7be8f-d5e4-453a-ac0d-1406ff8c69b3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(sashao) #6

David, thanks for your answer!
So, I"ll try to sum up the things untill this point:

  1. My general purpose is to preserve specific characters during
    tokenization, so if someone searchs for "25$" he will find docs with 25$
    only and not docs with other occurrences of 25.
  2. To do it I thought to customize the "word_delimiter" filter in with
    custom "type_table": ["$ => ALPHA"].
  3. Now I need to include this filter into analyzer definition, so I thought
    about two options:
    a. override the built in "english" analizer - I'm not sure how do it and
    if that's possible at all, but that would probably most convinient solution
    for the specific problem.
    b. create custom english analyzer - the problem, I'm not sure what is the
    right filters list to put there to* preserve the built-in english tokenizer
    behaviour.*
    So untill this point and using David's comment, I thought about
    following definition:
    "settings": {
    "index": {
    "analysis": {
    "filter": {
    "custom_word_delimiter": {
    "type": "word_delimiter",
    "generate_word_parts": false,
    "catenate_words": true,
    "split_on_numerics": false,
    "preserve_original": true,
    "type_table": [ "# => ALPHA"]
    },
    "stop_english": {
    "type": "stop",
    "stopwords": [
    "english"
    ]
    },
    "english_stemmer": {
    "type": "stemmer",
    "name": "english"
    }
    },
    • "analyzer": {*
  •      "my_custom_english": {*
    
  •        "type": "custom",*
    
  •        "tokenizer": "standard",*
    
  •        "filter": [*
    
  •          "lowercase",*
    
  •          "custom_word_delimiter",*
    
  •          "english_stemmer",*
    
  •          "stop_english"*
    
  •        ]*
    
  •      }*
      }
    }
    
    }

So the question can that do the work and is that an optimal solution (I
need to support several languages) for the problem.

Thanks!!!
Sasha

On Wednesday, December 4, 2013 12:28:21 PM UTC+2, David Pilato wrote:

english tokenizer does not exist.
english analyzer uses a standard tokenizer.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 4 décembre 2013 at 10:59:08, Sasha Ostrikov (alexander...@gmail.com<javascript:>)
a écrit:

Sure, so this is my configuration (using Sense plugin for chrome):
POST _template/temp1
{
"template": "",
"order": "5",
"settings": {
"index": {
"analysis": {
"filter": {
"word_delimiter_filter": {
"type": "word_delimiter",
"generate_word_parts": false,
"catenate_words": true,
"split_on_numerics": false,
"preserve_original": true,
"type_table": [
"# => ALPHA",
"@ => ALPHA",
"% => ALPHA",
"$ => ALPHA",
"% => ALPHA"
]
},
"stop_english": {
"type": "stop",
"stopwords": [
"english"
]
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
},
"analyzer": {
"english": { //HERE I'M TRYING TO OVERRIDE THE BUILT IN
english ANALYZER
"filter": [
"word_delimiter_filter"
]
},
"english2": { //HERE I'M TRYING TO CONFIG MY OWN english
ANALYZER THAT WOULD BEHAVE LIKE THE BUILT IN
"type": "custom",
"tokenizer": "english",
"filter": [
"lowercase",
"word_delimiter_filter",
"english_stemmer",
"stop_english"
]
}
}
}
}
},
"mappings": {
"default": {
"dynamic_templates": [
{
"template_textEnglish": {
"match": "text.English.
",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english",
"term_vector": "with_positions_offsets"
}
}
},
{
"template_textEnglish": {
"match": "text.English2.*",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english2",
"term_vector": "with_positions_offsets"
}
}
}
]
}
}
}

and this is the error I get trying to create a new index:
{
"error": "IndexCreationException[[test1] failed to create index];
nested: ElasticSearchIllegalArgumentException[failed to find analyzer type
[null] or tokenizer for [english]]; ",
"status": 400
}

On Tuesday, December 3, 2013 10:42:41 PM UTC+2, Kurt Hurtado wrote:

Hi Sasha,
Would you mind posting the full curl commands or some other
representation of the settings and mappings you're creating?
Thanks!

On Tuesday, December 3, 2013 12:05:21 PM UTC-8, Sasha Ostrikov wrote:

Hello friends,

I'm trying to preserve specific characters during tokenization using
word_delimiter filter by defining the type_table (as described in
http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html
).
Actually my idea is to override the built-in English analyzer by
including custom configured "word_delimiter" ("type_table": ["# => ALPHA",
"@ => ALPHA"]) filter, but I cannot find any way to do it.
I also tried to create a custom english analyzer but still getting next
problems:

  1. I don't actually know the default settings of the built-in english
    analyzer (But I really want to preserve it)
  2. While trying to set "tokenizer": english getting an error on creating
    index, saying that english tokenizer is not found.
    I'm using 0.90.5 ES

Hope for your kind help!
Sasha

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8cbc14ec-abf9-48b0-84f5-c4f3b9d1060e%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cde0ffeb-9452-4ffd-ba95-b40f18c2533c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #7