One token multiple normalizations

drount · May 30, 2016, 11:26am

Hello,

Im setting up a project where I need to have several normalizations for a given token.

For example, the text:

I Don't go

should be result of the queries:

I dont go
I don t go
I don't go

Thus I need the token "Don't" to be normalized to "dont" and "don t".

Stemming is not an option as there are tokens that are invented words (ex. C.O.R.E, should be normalized to CORE and C O R E)

Any idea on how to solve this?

Thank you!

jpountz · May 30, 2016, 12:56pm

This probably requires a custom token filter that inserts synonyms.

drount · May 30, 2016, 1:10pm

I'm quite surprised I haven't found anything similar. To me, this looks like a problem that lots of system can potentially have.

pratyusha · May 30, 2016, 1:22pm

You need to use a custom filter to replace the character ' with what ever you want.

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-replace-charfilter.html

Eg:

    "char_filter": {
      "my_pattern": {
        "type": "pattern_replace",
        "pattern": "([-()',])",
        "replacement": ""
      },
      "my_mapping": {
        "type": "mapping",
        "mappings_path": "/etc/elasticsearch/replace_words.txt"
      }
    }

drount · May 30, 2016, 1:30pm

The problem with that solution is that I lose the original one.

As an example:

with "Don't" I can pattern_replace ' with (empty) to get "Dont"

However, the token "Don't" is now lost and thus, I cannot not get "Don t".

In addition, I want this to be Token based and char filtering is pre-tokenizing.

pratyusha · May 30, 2016, 1:58pm

What if you use the same analyser for your search query also? Original data anyway will be there in source.

For "I don t go" also if you want "I don't go" as the result , then ngram analyser also shall be used I guess.

Topic		Replies	Views
Stopwords are not working in custom tokenizer Elasticsearch	3	388	April 29, 2021
Stop words and Keyword tokenizer Elasticsearch	12	1904	July 6, 2017
Combining ngram tokenizer with stopwords Elasticsearch	1	100	April 12, 2024
Allow the reverse token filter to be used in normalizers? Elasticsearch	1	642	October 25, 2017
Multiple tokenizers inside one Custom Analyser in Elasticsearch Elasticsearch	1	1905	October 26, 2018

One token multiple normalizations

Related topics