Thank you for the earlier response! Your solution using multi-fields with separate analyzers works for preserving special characters like chatgpt.com and AT&T.
Follow-Up Concern:
In my use case, each index will store millions of documents (multi-language indices). If I define two analyzers for the description field (e.g., simple for general analysis and exact for special characters):
Will this double the storage/RAM usage for the description field?
Are there optimizations in Elasticsearch to mitigate this overhead (e.g., compression, shared resources)?
If the overhead is significant, are there alternative approaches to achieve exact matches without duplicating analysis (e.g., using keyword types with normalizers, runtime fields, or custom token filters)?
Goal:
Retain exact matching for terms with special characters (., &, etc.).
Minimize resource consumption for large-scale deployments.
I think we're just adding keyword type for exact match so it will contribute in storage for sure but not to the same extent as a language-analyzed field.
I think this is approach is very custom so it is hard to give estimate. I would recommend to perform some testing or bechmark. But I think it won't impact too much.
If overhead is significant, You can anyways scale it. Also you can use another approach by using match_phrase query. You don't need to add any custom analyzer.
We currently maintain separate indices for each language supported by Elasticsearch, utilizing the built-in language-specific analyzers. These indices have multiple shards and replicas to ensure high availability, and auto-scaling is also in place. Given that we process and store massive amounts of data—more than 40 to 50 million documents, divided by language and index—cost plays a crucial role in maintaining production efficiency.
Each document contains four fields that require analysis. If we define two analyzers for all these fields, it will significantly increase storage requirements and CPU utilization. However, we rely on Elasticsearch's specialized language analyzers, which handle stemming and stopword removal efficiently, reducing our workload for data filtering and search optimization.
To further refine our approach, I plan to define a char_filter that replaces specific special characters during analysis, preserving word values for terms like #Elastic or AT&T. The analyzer configuration will look something like this:
This approach allows us to optimize search accuracy while keeping resource usage under control. Let me know if you have any suggestions or alternative solutions.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.