I'm experiencing a significant performance issue after adding a new multi-valued text field to a large Elasticsearch index and would appreciate insights from anyone who has faced similar challenges.
Setup:
- Elasticsearch 8.13.0, 4-node cluster
- Index: 400M documents, 18 shards (~100GB to 150GB per shard)
- Routing by a specific field let's say routing_id (there are thousands of routing_id)
- New field: names (text with edge n-gram analyzer, 2-20 chars)
- Multi-valued: 3-4 entries per document on average
Problem:
Initially tested the field on 150K documents with excellent performance (milliseconds). After backfilling all 400M documents, query performance degraded catastrophically from milliseconds to 10+ seconds.
Mapping:
{
"mappings": {
"properties": {
"routing_id": {"type": "keyword"},
"unique_id": {"type": "keyword"},
"names": {
"type": "text",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "standard",
"fields": {
"keyword": {"type": "keyword"}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"edge_ngram_analyzer": {
"type": "custom",
"tokenizer": "edge_ngram_tokenizer",
"filter": ["lowercase"]
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20,
"token_chars": ["letter","digit"]
}
}
}
}
}
Query Pattern:
GET index/_search?routing=123
{
"size": 0,
"query": {
"bool": {
"filter": [
{"term": {“routing_id”: 123}},
{"match": {“names”: {"query": "searchterm", "fuzziness": "1"}}}
]
}
},
"aggs": {
"distribution": {
"terms": {"field": "names.keyword", "size": 100},
"aggs": {
"unique_count": {"cardinality": {"field": “unique_id"}}
}
}
}
}
Goal:
I'm implementing an autocomplete feature for names with the following requirements:
- Interactive Autocomplete: As users type names (e.g., “Joh”), show matching names (e.g., “Johannes, “John Borris”, “Some other John”)
- Document Counts: For each suggested name, display how many documents exist (need to count only the unique ones based on a field called unique_id)
- Substring Matching: Users should be able to search for any part of a name, not just prefixes (e.g., “tris” should match “Dimitris”) or the second word of a name (e.g. “Makr” should match “Dimitris Makris”)
- Routing Isolation: Results must be filtered by routing_id
- Performance Requirements: Sub-second response time for interactive user experience
Analysis:
Despite routing and filtering to ~5,000 documents or so, performance is poor (like 10 seconds). I believe this is due to term dictionary size explosion - the field now contains terms from 400M Ă— 3-4 = ~1.2B entries, and ES must traverse the entire term dictionary before applying filters. Also the nature of the field is high cardinality which I assume doesn't help on this situation.
Attempted Optimizations:
- Replacing match+fuzziness with prefix queries
- Adjusting refresh intervals
- execution_hint: "map" for aggregations
Planned Optimizations:
- Reindexing in order to reduce the size of each shard (but I'm not sure if the performance will become acceptable)
Questions:
What's the best architectural approach for this type of "autocomplete with aggs" functionality at scale? Should I:
- Optimize the current approach?
- Move to a separate autocomplete-specific index?
- Use completion suggester with some additions maybe?
- Consider a different approach entirely?
Any insights on handling autocomplete functionality at this scale with the above-mentioned functionality would be greatly appreciated!