"How to Create Index and Search with Azerbaijani Character Normalization in Elasticsearch?"

karim_mirzaguliyev · September 15, 2024, 9:17pm

Hello Elasticsearch community,

I am working on an Elasticsearch implementation to support searches in Azerbaijani, where users can input transliterated versions of words (e.g., "sixmemmedov") and still match documents containing Azerbaijani-specific characters (e.g., "şıxməmmədov"). I’m trying to normalize these characters using a custom analyzer and filters.

Current Index Configuration

Here is the latest version of my index_body:

index_body = {
    "settings": {
        "analysis": {
            "filter": {
                "asciifolding_filter": {
                    "type": "asciifolding",
                    "preserve_original": True
                },
                "custom_character_filter": {
                    "type": "pattern_replace",
                    "pattern": "sh",
                    "replacement": "s"
                }
            },
            "analyzer": {
                "custom_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "asciifolding_filter",
                        "custom_character_filter"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "custom_analyzer"
            },
            "content": {
                "type": "text",
                "analyzer": "custom_analyzer"
            }
        }
    }
}

What I've Tried

I created this index using the Elasticsearch Python client and used the following code for searching:

query_body = {
    "size": 10,
    "query": {
        "multi_match": {
            "query": "sixmemmedov",
            "fields": ["title", "content"],
            "fuzziness": "AUTO:5",
            "max_expansions": 100
        }
    }
}

response = await es.search(index=index_name, body=query_body)

What I Need Help With

Index Creation: Is my current index setup the best way to handle Azerbaijani characters and transliterations (e.g., "ş" to "s", "ə" to "e")? How can I improve this configuration to cover more complex transliterations and character mappings?
Search: How can I make my search more flexible to match Azerbaijani characters with their transliterated forms? Should I use different filters or analyzers for the query to achieve more consistent results?
Pattern Replacement: Is using pattern_replace filter the correct approach for handling character mapping, or is there a more efficient way to set up this transliteration?

Any advice, best practices, or sample configurations on creating an index and performing searches to handle these character mappings would be greatly appreciated. Thank you in advance for your assistance!

Topic		Replies	Views
Multilingual Search Elasticsearch	1	356	July 19, 2022
Custom analyzer don't match with ASCII folding filter values Elasticsearch	1	556	December 17, 2019
Accent insensitive search with search analyzer Elasticsearch	8	12063	January 30, 2018
Issue while indexing with custom analyzers Elasticsearch	3	26	July 17, 2024
Serbian analyzer setup Elasticsearch	4	2483	November 21, 2017

"How to Create Index and Search with Azerbaijani Character Normalization in Elasticsearch?"

Current Index Configuration

What I've Tried

What I Need Help With

Related topics