I'm experiencing a significant performance issue after adding a new multi-valued text field to a large Elasticsearch index and would appreciate insights from anyone who has faced similar challenges.
Setup:
- Elasticsearch 8.13.0, 4-node cluster
- Index: 400M documents, 18 shards (~100GB to 150GB per shard)
- Routing by a specific field let's say routing_id (there are thousands of routing_id)
- New field: names (text with edge n-gram analyzer, 2-20 chars)
- Multi-valued: 3-4 entries per document on average
Problem:
Initially tested the field on 150K documents with excellent performance (milliseconds). After backfilling all 400M documents, query performance degraded catastrophically from milliseconds to 10+ seconds.
Mapping:
{
  "mappings": {
    "properties": {
      "routing_id": {"type": "keyword"},
      "unique_id": {"type": "keyword"},
      "names": {
        "type": "text",
        "analyzer": "edge_ngram_analyzer",
        "search_analyzer": "standard",
        "fields": {
          "keyword": {"type": "keyword"}
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "edge_ngram_analyzer": {
          "type": "custom",
          "tokenizer": "edge_ngram_tokenizer",
          "filter": ["lowercase"]
        }
      },
      "tokenizer": {
        "edge_ngram_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 20,
          "token_chars": ["letter","digit"]
        }
      }
    }
  }
}
Query Pattern:
GET index/_search?routing=123
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {"term": {“routing_id”: 123}},
        {"match": {“names”: {"query": "searchterm", "fuzziness": "1"}}}
      ]
    }
  },
  "aggs": {
    "distribution": {
      "terms": {"field": "names.keyword", "size": 100},
      "aggs": {
        "unique_count": {"cardinality": {"field": “unique_id"}}
      }
    }
  }
}
Goal:
I'm implementing an autocomplete feature for names with the following requirements:
- Interactive Autocomplete: As users type names (e.g., “Joh”), show matching names (e.g., “Johannes, “John Borris”, “Some other John”)
- Document Counts: For each suggested name, display how many documents exist (need to count only the unique ones based on a field called unique_id)
- Substring Matching: Users should be able to search for any part of a name, not just prefixes (e.g., “tris” should match “Dimitris”) or the second word of a name (e.g. “Makr” should match “Dimitris Makris”)
- Routing Isolation: Results must be filtered by routing_id
- Performance Requirements: Sub-second response time for interactive user experience
Analysis:
Despite routing and filtering to ~5,000 documents or so, performance is poor (like 10 seconds). I believe this is due to term dictionary size explosion - the field now contains terms from 400M Ă— 3-4 = ~1.2B entries, and ES must traverse the entire term dictionary before applying filters. Also the nature of the field is high cardinality which I assume doesn't help on this situation.
Attempted Optimizations:
- Replacing match+fuzziness with prefix queries
- Adjusting refresh intervals
- execution_hint: "map" for aggregations
Planned Optimizations:
- Reindexing in order to reduce the size of each shard (but I'm not sure if the performance will become acceptable)
Questions:
What's the best architectural approach for this type of "autocomplete with aggs" functionality at scale? Should I:
- Optimize the current approach?
- Move to a separate autocomplete-specific index?
- Use completion suggester with some additions maybe?
- Consider a different approach entirely?
Any insights on handling autocomplete functionality at this scale with the above-mentioned functionality would be greatly appreciated!
