drount
(Sergio Lopez)
May 30, 2016, 11:26am
1
Hello,
Im setting up a project where I need to have several normalizations for a given token.
For example, the text:
I Don't go
should be result of the queries:
I dont go
I don t go
I don't go
Thus I need the token "Don't" to be normalized to "dont" and "don t".
Stemming is not an option as there are tokens that are invented words (ex. C.O.R.E, should be normalized to CORE and C O R E)
Any idea on how to solve this?
Thank you!
jpountz
(Adrien Grand)
May 30, 2016, 12:56pm
2
This probably requires a custom token filter that inserts synonyms.
drount
(Sergio Lopez)
May 30, 2016, 1:10pm
3
I'm quite surprised I haven't found anything similar. To me, this looks like a problem that lots of system can potentially have.
pratyusha
(Pratyusha Rasamsetty)
May 30, 2016, 1:22pm
4
You need to use a custom filter to replace the character ' with what ever you want.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-replace-charfilter.html
Eg:
"char_filter": {
"my_pattern": {
"type": "pattern_replace",
"pattern": "([-()',])",
"replacement": ""
},
"my_mapping": {
"type": "mapping",
"mappings_path": "/etc/elasticsearch/replace_words.txt"
}
}
drount
(Sergio Lopez)
May 30, 2016, 1:30pm
5
The problem with that solution is that I lose the original one.
As an example:
with "Don't" I can pattern_replace ' with (empty) to get "Dont"
However, the token "Don't" is now lost and thus, I cannot not get "Don t".
In addition, I want this to be Token based and char filtering is pre-tokenizing.
pratyusha
(Pratyusha Rasamsetty)
May 30, 2016, 1:58pm
6
What if you use the same analyser for your search query also? Original data anyway will be there in source.
For "I don t go" also if you want "I don't go" as the result , then ngram analyser also shall be used I guess.