One token multiple normalizations

Hello,

Im setting up a project where I need to have several normalizations for a given token.

For example, the text:

I Don't go

should be result of the queries:

I dont go
I don t go
I don't go

Thus I need the token "Don't" to be normalized to "dont" and "don t".

Stemming is not an option as there are tokens that are invented words (ex. C.O.R.E, should be normalized to CORE and C O R E)

Any idea on how to solve this?

Thank you!

This probably requires a custom token filter that inserts synonyms.

I'm quite surprised I haven't found anything similar. To me, this looks like a problem that lots of system can potentially have.

You need to use a custom filter to replace the character ' with what ever you want.

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-replace-charfilter.html

Eg:

    "char_filter": {
      "my_pattern": {
        "type": "pattern_replace",
        "pattern": "([-()',])",
        "replacement": ""
      },
      "my_mapping": {
        "type": "mapping",
        "mappings_path": "/etc/elasticsearch/replace_words.txt"
      }
    }

The problem with that solution is that I lose the original one.

As an example:

with "Don't" I can pattern_replace ' with (empty) to get "Dont"

However, the token "Don't" is now lost and thus, I cannot not get "Don t".

In addition, I want this to be Token based and char filtering is pre-tokenizing.

What if you use the same analyser for your search query also? Original data anyway will be there in source.

For "I don t go" also if you want "I don't go" as the result , then ngram analyser also shall be used I guess.