Custom Token Filter: selectively remove .(dot) at end of word or catenate letter around .(dot)

Jade_Tremblay · December 3, 2014, 2:37pm

Hello,

I have created a custom analyzer with (tokenizer: whitespace).
I would like to remove dot only at the end of words AND catenate
letter/words if dot are between letters (ex: a.b.c => abc).
What is the way to handle this in ES?
I have try word_delimiter but it split words as soon as a dot is hit, I
don't wan't this behaviour.

Here is an example:

Sentence "the quick brown. fox.asd"

With actual analyzer, it gives

the 1quick 2brown. 3fox.asd 4

I would like to have
the 1quick 2brown 3foxasd 4

"html_exact_analyser": {
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
],
"tokenizer": "whitespace"
},

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f04e68db-28a0-4870-b623-4ddd97037ee4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jade_Tremblay · December 3, 2014, 3:41pm

I've been able to figure out how to do this with a char_filter

ref: Elasticsearch Platform — Find real-time answers at scale | Elastic

"char_filter": {
"remove_dot_pattern": {
"type": "pattern_replace",
"pattern": ".",
"replacement": ""
}
}

Le mercredi 3 décembre 2014 09:37:03 UTC-5, Jade Tremblay a écrit :

Hello,

I have created a custom analyzer with (tokenizer: whitespace).
I would like to remove dot only at the end of words AND catenate
letter/words if dot are between letters (ex: a.b.c => abc).
What is the way to handle this in ES?
I have try word_delimiter but it split words as soon as a dot is hit, I
don't wan't this behaviour.

Here is an example:

Sentence "the quick brown. fox.asd"

With actual analyzer, it gives

the 1quick 2brown. 3fox.asd 4

I would like to have
the 1quick 2brown 3foxasd 4

"html_exact_analyser": {
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
],
"tokenizer": "whitespace"
},

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7c9d0ede-e5a6-410d-9770-7f44b9ad4871%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Custom analyzer and char_group tokenizer - can't search for terms with dot Elasticsearch	1	882	February 1, 2019
'dot' analyzer Elasticsearch	2	1455	July 6, 2017
Cjk and thai analyzer customization Elasticsearch	4	696	July 6, 2017
Removing whitespace around a delimiter in a custom anaylzer Elasticsearch	12	3103	July 6, 2017
Keyword analyzer but allow redundant white spaces Elasticsearch	3	4092	January 15, 2018

Custom Token Filter: selectively remove .(dot) at end of word or catenate letter around .(dot)

Related topics