Override built-in analyzer


(sashao) #1

Hello friends,

I'm trying to preserve specific characters during tokenization using word_delimiter filter by defining the type_table (as described in http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html).
Actually my idea is to override the built-in English analyzer by including custom configured "word_delimiter" ("type_table": ["# => ALPHA", "@ => ALPHA"]) filter, but I cannot find any way to do it.
I also tried to create a custom english analyzer but still getting next problems:

  1. I don't actually know the default settings of the built-in english analyzer (But I really want to preserve it)
  2. While trying to set "tokenizer": english getting an error on creating index, saying that english tokenizer is not found.
    I'm using 0.90.5 ES

Hope for your kind help!
Sasha


(sashao) #2

This is my configuration (using Sense plugin for chrome):
POST _template/temp1
{
"template": "",
"order": "5",
"settings": {
"index": {
"analysis": {
"filter": {
"word_delimiter_filter": {
"type": "word_delimiter",
"generate_word_parts": false,
"catenate_words": true,
"split_on_numerics": false,
"preserve_original": true,
"type_table": [
"# => ALPHA",
"@ => ALPHA",
"% => ALPHA",
"$ => ALPHA",
"% => ALPHA"
]
},
"stop_english": {
"type": "stop",
"stopwords": [
"english"
]
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
},
"analyzer": {
"english": { //HERE I'M TRYING TO OVERRIDE THE BUILT IN english ANALYZER
"filter": [
"word_delimiter_filter"
]
},
"english2": { //HERE I'M TRYING TO CONFIG MY OWN english ANALYZER THAT WOULD BEHAVE LIKE THE BUILT IN
"type": "custom",
"tokenizer": "english",
"filter": [
"lowercase",
"word_delimiter_filter",
"english_stemmer",
"stop_english"
]
}
}
}
}
},
"mappings": {
"default": {
"dynamic_templates": [
{
"template_textEnglish": {
"match": "text.English.
",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english",
"term_vector": "with_positions_offsets"
}
}
},
{
"template_textEnglish": {
"match": "text.English2.*",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english2",
"term_vector": "with_positions_offsets"
}
}
}
]
}
}
}

and this is the error I get trying to create a new index:
{
"error": "IndexCreationException[[test1] failed to create index]; nested: ElasticSearchIllegalArgumentException[failed to find analyzer type [null] or tokenizer for [english]]; ",
"status": 400
}


(system) #3