David, thanks for your answer!
So, I"ll try to sum up the things untill this point:
- My general purpose is to preserve specific characters during
tokenization, so if someone searchs for "25$" he will find docs with 25$
only and not docs with other occurrences of 25. - To do it I thought to customize the "word_delimiter" filter in with
custom "type_table": ["$ => ALPHA"]. - Now I need to include this filter into analyzer definition, so I thought
about two options:
a. override the built in "english" analizer - I'm not sure how do it and
if that's possible at all, but that would probably most convinient solution
for the specific problem.
b. create custom english analyzer - the problem, I'm not sure what is the
right filters list to put there to* preserve the built-in english tokenizer
behaviour.*
So untill this point and using David's comment, I thought about
following definition:
"settings": {
"index": {
"analysis": {
"filter": {
"custom_word_delimiter": {
"type": "word_delimiter",
"generate_word_parts": false,
"catenate_words": true,
"split_on_numerics": false,
"preserve_original": true,
"type_table": [ "# => ALPHA"]
},
"stop_english": {
"type": "stop",
"stopwords": [
"english"
]
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
},- "analyzer": {*
-
"my_custom_english": {* -
"type": "custom",* -
"tokenizer": "standard",* -
"filter": [* -
"lowercase",* -
"custom_word_delimiter",* -
"english_stemmer",* -
"stop_english"* -
]* -
}}* } }
So the question can that do the work and is that an optimal solution (I
need to support several languages) for the problem.
Thanks!!!
Sasha
On Wednesday, December 4, 2013 12:28:21 PM UTC+2, David Pilato wrote:
english tokenizer does not exist.
english analyzer uses a standard tokenizer.--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfrLe 4 décembre 2013 at 10:59:08, Sasha Ostrikov (alexander...@gmail.com<javascript:>)
a écrit:Sure, so this is my configuration (using Sense plugin for chrome):
POST _template/temp1
{
"template": "",
"order": "5",
"settings": {
"index": {
"analysis": {
"filter": {
"word_delimiter_filter": {
"type": "word_delimiter",
"generate_word_parts": false,
"catenate_words": true,
"split_on_numerics": false,
"preserve_original": true,
"type_table": [
"# => ALPHA",
"@ => ALPHA",
"% => ALPHA",
"$ => ALPHA",
"% => ALPHA"
]
},
"stop_english": {
"type": "stop",
"stopwords": [
"english"
]
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
},
"analyzer": {
"english": { //HERE I'M TRYING TO OVERRIDE THE BUILT IN
english ANALYZER
"filter": [
"word_delimiter_filter"
]
},
"english2": { //HERE I'M TRYING TO CONFIG MY OWN english
ANALYZER THAT WOULD BEHAVE LIKE THE BUILT IN
"type": "custom",
"tokenizer": "english",
"filter": [
"lowercase",
"word_delimiter_filter",
"english_stemmer",
"stop_english"
]
}
}
}
}
},
"mappings": {
"default": {
"dynamic_templates": [
{
"template_textEnglish": {
"match": "text.English.",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english",
"term_vector": "with_positions_offsets"
}
}
},
{
"template_textEnglish": {
"match": "text.English2.*",
"mapping": {
"type": "string",
"store": "yes",
"index": "analyzed",
"analyzer": "english2",
"term_vector": "with_positions_offsets"
}
}
}
]
}
}
}and this is the error I get trying to create a new index:
{
"error": "IndexCreationException[[test1] failed to create index];
nested: ElasticSearchIllegalArgumentException[failed to find analyzer type
[null] or tokenizer for [english]]; ",
"status": 400
}On Tuesday, December 3, 2013 10:42:41 PM UTC+2, Kurt Hurtado wrote:
Hi Sasha,
Would you mind posting the full curl commands or some other
representation of the settings and mappings you're creating?
Thanks!On Tuesday, December 3, 2013 12:05:21 PM UTC-8, Sasha Ostrikov wrote:
Hello friends,
I'm trying to preserve specific characters during tokenization using
word_delimiter filter by defining the type_table (as described in
The domain name Fullscale.co is for sale | Dan.com
).
Actually my idea is to override the built-in English analyzer by
including custom configured "word_delimiter" ("type_table": ["# => ALPHA",
"@ => ALPHA"]) filter, but I cannot find any way to do it.
I also tried to create a custom english analyzer but still getting next
problems:
- I don't actually know the default settings of the built-in english
analyzer (But I really want to preserve it)- While trying to set "tokenizer": english getting an error on creating
index, saying that english tokenizer is not found.
I'm using 0.90.5 ESHope for your kind help!
Sasha--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8cbc14ec-abf9-48b0-84f5-c4f3b9d1060e%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cde0ffeb-9452-4ffd-ba95-b40f18c2533c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.