Tokenization of Special Characters Disrupts Exact Phrase Matching in Multi-Language Indexes

Context:

  • Setup:
    • Multiple indices (one per language + a general index for unknown languages).
    • Each document has 10+ fields, including a description field.
    • Language-specific indices use corresponding analyzers (e.g., English analyzer for English).

Issue:

The description field splits terms at special characters (e.g., ., &), causing exact phrase searches to fail. Examples:

  1. chatgpt.com → Tokenized as [chatgpt, com] (split at .).
  2. AT&T → Tokenized as [AT, T] (split at &).

Goal:

  • Exact phrase matches for terms with special characters (e.g., chatgpt.com or AT&T).
  • Preserve language-specific analysis for other use cases (e.g., stemming, stop words).

Attempted Solution:

Used built-in language analyzers (e.g., english), but they split tokens at special characters.

Question:

How can I customize the analyzer or mapping to:

  1. Prevent splitting tokens at special characters (e.g., ., &).
  2. Still retain language-specific analysis for non-special-character terms?

Suggestions for Clarity (Already Included Above):

  1. Use sub-fields (multi-fields) for description to support both language-specific and exact matching.
  2. Define a custom analyzer with a tokenizer/pattern that retains special characters.
  3. Use match_phrase or keyword types for exact queries.

Hi @Himanshu_Gautam1, Welcome to the Elastic community -

I was not able to reproduce the same using english analyzer because it producing exact token like chatgpt.com but I reproduced with simple analyzer.

Below is the quick example for exact with full text search.

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "special_character_preserving_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "description": {
        "type": "text",
        "analyzer": "simple",  // Language-specific analyzer
        "fields": {
          "exact": {
            "type": "text",
            "analyzer": "special_character_preserving_analyzer"
          }
        }
      }
    }
  }
}

POST test/_doc
{
    "description":"chatgpt.com"
}

POST test/_doc
{
    "description":"AT&T"
}

GET test/_search
{
  "query": {
    "multi_match" : {
      "query": "chatgpt.com",
      "fields": ["description", "description.exact"]
    }
  }
}

Hi ashishtiwari1993,

Thank you for the earlier response! Your solution using multi-fields with separate analyzers works for preserving special characters like chatgpt.com and AT&T.

Follow-Up Concern:
In my use case, each index will store millions of documents (multi-language indices). If I define two analyzers for the description field (e.g., simple for general analysis and exact for special characters):

  1. Will this double the storage/RAM usage for the description field?
  2. Are there optimizations in Elasticsearch to mitigate this overhead (e.g., compression, shared resources)?
  3. If the overhead is significant, are there alternative approaches to achieve exact matches without duplicating analysis (e.g., using keyword types with normalizers, runtime fields, or custom token filters)?

Goal:

  • Retain exact matching for terms with special characters (., &, etc.).
  • Minimize resource consumption for large-scale deployments.

Your insights would be invaluable!

Hi Himanshu,

  1. I think we're just adding keyword type for exact match so it will contribute in storage for sure but not to the same extent as a language-analyzed field.
  2. I think this is approach is very custom so it is hard to give estimate. I would recommend to perform some testing or bechmark. But I think it won't impact too much.
  3. If overhead is significant, You can anyways scale it. Also you can use another approach by using match_phrase query. You don't need to add any custom analyzer.
PUT test1
{
  "mappings": {
    "properties": {
      "description": {
        "type": "text",
        "analyzer": "simple"
      }
    }
  }
}

POST test1/_doc
{
    "description":"chatgpt.com"
}

POST test1/_doc
{
  "description": "AT&T"
}

POST test1/_doc
{
  "description": "T&L"
}

GET test1/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "description": "T"
          }
        },
        {
          "match_phrase": {
            "description": "AT&T"
          }
        }
      ]
    }
  }
}

We currently maintain separate indices for each language supported by Elasticsearch, utilizing the built-in language-specific analyzers. These indices have multiple shards and replicas to ensure high availability, and auto-scaling is also in place. Given that we process and store massive amounts of data—more than 40 to 50 million documents, divided by language and index—cost plays a crucial role in maintaining production efficiency.

Each document contains four fields that require analysis. If we define two analyzers for all these fields, it will significantly increase storage requirements and CPU utilization. However, we rely on Elasticsearch's specialized language analyzers, which handle stemming and stopword removal efficiently, reducing our workload for data filtering and search optimization.

To further refine our approach, I plan to define a char_filter that replaces specific special characters during analysis, preserving word values for terms like #Elastic or AT&T. The analyzer configuration will look something like this:

{
    "settings": {
        "analysis": {
            "char_filter": {
                "replace_special_char": {
                    "type": "mapping",
                    "mappings": SPECIAL_CHAR_MAPPING
                }
            },
            "filter": {
                "english_stop": {
                    "type": "stop",
                    "stopwords": "_english_"
                },
                "english_keywords": {
                    "type": "keyword_marker",
                    "keywords": [
                        "example"
                    ]
                },
                "english_stemmer": {
                    "type": "stemmer",
                    "language": "english"
                },
                "english_possessive_stemmer": {
                    "type": "stemmer",
                    "language": "possessive_english"
                }
            },
            "analyzer": {
                "rebuilt_english": {
                    "tokenizer": "standard",
                    "char_filter": ["replace_special_char"],
                    "filter": [
                        "english_possessive_stemmer",
                        "lowercase",
                        "english_stop",
                        "english_keywords",
                        "english_stemmer"
                    ]
                }
            }
        }
    }
}

This approach allows us to optimize search accuracy while keeping resource usage under control. Let me know if you have any suggestions or alternative solutions.