"How to Create Index and Search with Azerbaijani Character Normalization in Elasticsearch?"

Hello Elasticsearch community,

I am working on an Elasticsearch implementation to support searches in Azerbaijani, where users can input transliterated versions of words (e.g., "sixmemmedov") and still match documents containing Azerbaijani-specific characters (e.g., "şıxməmmədov"). I’m trying to normalize these characters using a custom analyzer and filters.

Current Index Configuration

Here is the latest version of my index_body:

index_body = {
    "settings": {
        "analysis": {
            "filter": {
                "asciifolding_filter": {
                    "type": "asciifolding",
                    "preserve_original": True
                },
                "custom_character_filter": {
                    "type": "pattern_replace",
                    "pattern": "sh",
                    "replacement": "s"
                }
            },
            "analyzer": {
                "custom_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "asciifolding_filter",
                        "custom_character_filter"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "custom_analyzer"
            },
            "content": {
                "type": "text",
                "analyzer": "custom_analyzer"
            }
        }
    }
}

What I've Tried

I created this index using the Elasticsearch Python client and used the following code for searching:

query_body = {
    "size": 10,
    "query": {
        "multi_match": {
            "query": "sixmemmedov",
            "fields": ["title", "content"],
            "fuzziness": "AUTO:5",
            "max_expansions": 100
        }
    }
}

response = await es.search(index=index_name, body=query_body)

What I Need Help With

  1. Index Creation: Is my current index setup the best way to handle Azerbaijani characters and transliterations (e.g., "ş" to "s", "ə" to "e")? How can I improve this configuration to cover more complex transliterations and character mappings?
  2. Search: How can I make my search more flexible to match Azerbaijani characters with their transliterated forms? Should I use different filters or analyzers for the query to achieve more consistent results?
  3. Pattern Replacement: Is using pattern_replace filter the correct approach for handling character mapping, or is there a more efficient way to set up this transliteration?

Any advice, best practices, or sample configurations on creating an index and performing searches to handle these character mappings would be greatly appreciated. Thank you in advance for your assistance!