Override built-in analyzer

sashao · December 4, 2013, 12:46pm

David, thanks for your answer!
So, I"ll try to sum up the things untill this point:

My general purpose is to preserve specific characters during
tokenization, so if someone searchs for "25$" he will find docs with 25$
only and not docs with other occurrences of 25.
To do it I thought to customize the "word_delimiter" filter in with
custom "type_table": ["$ => ALPHA"].
Now I need to include this filter into analyzer definition, so I thought
about two options:
a. override the built in "english" analizer - I'm not sure how do it and
if that's possible at all, but that would probably most convinient solution
for the specific problem.
b. create custom english analyzer - the problem, I'm not sure what is the
right filters list to put there to* preserve the built-in english tokenizer
behaviour.*
So untill this point and using David's comment, I thought about
following definition:
"settings": {
"index": {
"analysis": {
"filter": {
"custom_word_delimiter": {
"type": "word_delimiter",
"generate_word_parts": false,
"catenate_words": true,
"split_on_numerics": false,
"preserve_original": true,
"type_table": [ "# => ALPHA"]
},
"stop_english": {
"type": "stop",
"stopwords": [
"english"
]
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
},
- "analyzer": {*

```
     "my_custom_english": {*
```
```
       "type": "custom",*
```
```
       "tokenizer": "standard",*
```
```
       "filter": [*
```
```
         "lowercase",*
```
```
         "custom_word_delimiter",*
```
```
         "english_stemmer",*
```
```
         "stop_english"*
```
```
       ]*
```
```
     }*
  }
}
```
}

So the question can that do the work and is that an optimal solution (I
need to support several languages) for the problem.

Thanks!!!
Sasha

On Wednesday, December 4, 2013 12:28:21 PM UTC+2, David Pilato wrote:

english tokenizer does not exist.
english analyzer uses a standard tokenizer.

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 4 décembre 2013 at 10:59:08, Sasha Ostrikov (alexander...@gmail.com<javascript:>)
a écrit:

Sure, so this is my configuration (using Sense plugin for chrome):
POST _template/temp1
{
"template": "",
"order": "5",
"settings": {
"index": {
"analysis": {
"filter": {
"word_delimiter_filter": {
"type": "word_delimiter",
"generate_word_parts": false,
"catenate_words": true,
"split_on_numerics": false,
"preserve_original": true,
"type_table": [
"# => ALPHA",
"@ => ALPHA",
"% => ALPHA",
"$ => ALPHA",
"% => ALPHA"
]
},
"stop_english": {
"type": "stop",
"stopwords": [
"english"
]
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
},
"analyzer": {
"english": { //HERE I'M TRYING TO OVERRIDE THE BUILT IN
english ANALYZER
"filter": [
"word_delimiter_filter"
]
},
"english2": { //HERE I'M TRYING TO CONFIG MY OWN english
ANALYZER THAT WOULD BEHAVE LIKE THE BUILT IN
"type": "custom",
"tokenizer": "english",
"filter": [
"lowercase",
"word_delimiter_filter",
"english_stemmer",
"stop_english"
]
}
}
}
}
},
"mappings": {
"default": {
"dynamic_templates": [
{
"template_textEnglish": {
"match": "text.English.",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english",
"term_vector": "with_positions_offsets"
}
}
},
{
"template_textEnglish": {
"match": "text.English2.*",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english2",
"term_vector": "with_positions_offsets"
}
}
}
]
}
}
}

and this is the error I get trying to create a new index:
{
"error": "IndexCreationException[[test1] failed to create index];
nested: ElasticSearchIllegalArgumentException[failed to find analyzer type
[null] or tokenizer for [english]]; ",
"status": 400
}

On Tuesday, December 3, 2013 10:42:41 PM UTC+2, Kurt Hurtado wrote:

Hi Sasha,
Would you mind posting the full curl commands or some other
representation of the settings and mappings you're creating?
Thanks!

On Tuesday, December 3, 2013 12:05:21 PM UTC-8, Sasha Ostrikov wrote:

Hello friends,

I'm trying to preserve specific characters during tokenization using
word_delimiter filter by defining the type_table (as described in
The domain name Fullscale.co is for sale | Dan.com
).
Actually my idea is to override the built-in English analyzer by
including custom configured "word_delimiter" ("type_table": ["# => ALPHA",
"@ => ALPHA"]) filter, but I cannot find any way to do it.
I also tried to create a custom english analyzer but still getting next
problems:

I don't actually know the default settings of the built-in english
analyzer (But I really want to preserve it)

While trying to set "tokenizer": english getting an error on creating
index, saying that english tokenizer is not found.
I'm using 0.90.5 ES

Hope for your kind help!
Sasha

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8cbc14ec-abf9-48b0-84f5-c4f3b9d1060e%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cde0ffeb-9452-4ffd-ba95-b40f18c2533c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Override built-in analyzer Elasticsearch	2	397	July 6, 2017
Exception while creating a custom analyzer Elasticsearch	10	563	July 6, 2017
ElasticSearch won't recongize char_filter mapping Elasticsearch	6	1080	July 6, 2017
Different behaviour b/w custom and original Word Delimiter Token Filter Elasticsearch	5	396	July 6, 2017
Help with analyzer and mapping Elasticsearch	9	554	July 6, 2017

Override built-in analyzer

Related topics