Issues while search results having special charecters

I'm working on an Elasticsearch application for searching person data, including names in both English and French. I've completed the data indexing process.

Current Issue:

  1. Case Sensitivity: Searching for "Francois" doesn't match documents containing "françois" or "François."
  2. Special Characters: Names with accents (e.g., François) don't match searches without them (e.g., francois).
  3. Removing ASCII Characters: My current approach of removing ASCII characters during search hinders accurate matching for French names.

Desired Outcome:

I want to achieve case-insensitive and special character-insensitive matching for French names in my Elasticsearch search. This means:

  • Searching for "francois" should match documents containing "Francois," "françois," and potentially variations like "francois" (depending on the approach).
  • Accents and other relevant special characters in French names should not affect search results.

Is there a way to achieve this?

Welcome.

Have a look at Language analyzers | Elasticsearch Guide [8.14] | Elastic

That will help you to build this.

Hi @dadoonet
I've tried analyzers . The following is the analyzer I am using

**filters**
 "french_stop": {
              "type": "stop",
              "stopwords": "_french_"
            },
             "french_stemmer": {
              "name": "french",
              "type": "stemmer"
            },
        "french_elision": {
              "type": "elision",
              "articles": [
                "l",
                "m",
                "t",
                "qu",
                "n",
                "s",
                "j",
                "d",
                "c",
                "jusqu",
                "quoiqu",
                "lorsqu",
                "puisqu"
              ],
              "articles_case": "true"
            },
**Analyzer**
 "french": {
              "filter": [
                "lowercase",
                "french_stop",
                "french_stemmer",
                "french_elision"
              ],
              "tokenizer": "standard"
            }
          }

and I will share the result from the analyze api

This one not cover my expected results ->
If the input is francois then I want to match the records which contains françois as an exact match (right now its working with fuzziness).
Is there anything I am missing in my analyzers?

You need to add an asciifolding token filter as well.

And the test what happens when you analyze françois with the french analyzer.

1 Like