Severe Performance Degradation After Adding High-Cardinality Text Field to Large Index

I'm experiencing a significant performance issue after adding a new multi-valued text field to a large Elasticsearch index and would appreciate insights from anyone who has faced similar challenges.

Setup:

  • Elasticsearch 8.13.0, 4-node cluster
  • Index: 400M documents, 18 shards (~100GB to 150GB per shard)
  • Routing by a specific field let's say routing_id (there are thousands of routing_id)
  • New field: names (text with edge n-gram analyzer, 2-20 chars)
  • Multi-valued: 3-4 entries per document on average

Problem:
Initially tested the field on 150K documents with excellent performance (milliseconds). After backfilling all 400M documents, query performance degraded catastrophically from milliseconds to 10+ seconds.

Mapping:

{
  "mappings": {
    "properties": {
      "routing_id": {"type": "keyword"},
      "unique_id": {"type": "keyword"},
      "names": {
        "type": "text",
        "analyzer": "edge_ngram_analyzer",
        "search_analyzer": "standard",
        "fields": {
          "keyword": {"type": "keyword"}
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "edge_ngram_analyzer": {
          "type": "custom",
          "tokenizer": "edge_ngram_tokenizer",
          "filter": ["lowercase"]
        }
      },
      "tokenizer": {
        "edge_ngram_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 20,
          "token_chars": ["letter","digit"]
        }
      }
    }
  }
}

Query Pattern:

GET index/_search?routing=123
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {"term": {“routing_id”: 123}},
        {"match": {“names”: {"query": "searchterm", "fuzziness": "1"}}}
      ]
    }
  },
  "aggs": {
    "distribution": {
      "terms": {"field": "names.keyword", "size": 100},
      "aggs": {
        "unique_count": {"cardinality": {"field": “unique_id"}}
      }
    }
  }
}

Goal:
I'm implementing an autocomplete feature for names with the following requirements:

  1. Interactive Autocomplete: As users type names (e.g., “Joh”), show matching names (e.g., “Johannes, “John Borris”, “Some other John”)
  2. Document Counts: For each suggested name, display how many documents exist (need to count only the unique ones based on a field called unique_id)
  3. Substring Matching: Users should be able to search for any part of a name, not just prefixes (e.g., “tris” should match “Dimitris”) or the second word of a name (e.g. “Makr” should match “Dimitris Makris”)
  4. Routing Isolation: Results must be filtered by routing_id
  5. Performance Requirements: Sub-second response time for interactive user experience

Analysis:
Despite routing and filtering to ~5,000 documents or so, performance is poor (like 10 seconds). I believe this is due to term dictionary size explosion - the field now contains terms from 400M Ă— 3-4 = ~1.2B entries, and ES must traverse the entire term dictionary before applying filters. Also the nature of the field is high cardinality which I assume doesn't help on this situation.

Attempted Optimizations:

  • Replacing match+fuzziness with prefix queries
  • Adjusting refresh intervals
  • execution_hint: "map" for aggregations

Planned Optimizations:

  • Reindexing in order to reduce the size of each shard (but I'm not sure if the performance will become acceptable)

Questions:
What's the best architectural approach for this type of "autocomplete with aggs" functionality at scale? Should I:

  • Optimize the current approach?
  • Move to a separate autocomplete-specific index?
  • Use completion suggester with some additions maybe?
  • Consider a different approach entirely?

Any insights on handling autocomplete functionality at this scale with the above-mentioned functionality would be greatly appreciated!

Welcome!

Yes. That's the first thing I'd do. Have a look at Size your shards | Elastic Docs

  • Very large shards can slow down search operations and prolong recovery times after failures.
  • Aim for shard sizes between 10GB and 50GB

What would be interesting in your case would be to find "when" it starts to degrade in terms of performances. After how many docs, for what exact size...
As you are doing full text search here (it's not a time series dataset), I'd keep the shard size low, like 10gb.

As you are using specific routing keys, do you have shards much bigger than others? What is the GET _cat/shards?h command returns for your index?

1 Like

What is the hardware specification of this cluster? Are you using local SSDs?

I would also recommend upgrading to the latest 8.18 version as there are have been some performance improvements made that I am not sure are in your version.

This is a quite large shard size and can affect performance. I would recommend trying to reindex the data into a new index with 80-100 primary shards.

Using routing is the right way to go as long as each document only has a single routing value. This will force all documents that contain a specific routing value to be located in the same shard and when you specify routing at query time only a single shard will need to be searched, which together with smaller shards should help improve latency and increase supported query throughput.

Hi David, thank you very much for your reply.

Actually the suggestion on "when" it starts to degrade is super interesting. I will definitely look into this.

Regarding the shard size.. The shard sizes range from 100 to 150 GB

Hi Christian, thank you very much for your reply.

I have a similar view regarding the number of shards, but I'm unsure how to distribute them across nodes. Is there a hard limit on the number of shards per node?

I actually use routing (I updated the query to reflect it), but in that case I also use it for filtering. Imagine that each shard may contain ~ 50k routing keys and approximately 22 million documents.

I will also look into the newer version of ES. :folded_hands:

The limit per node is 1000, so you should be fine. Make sure you set the number of replica shards to 1 to reduce overall shard count.

Routing can result in uneven shard sizes depending on the distribution of routing keys, but efficiency generally increases with the primary shard count. As David suggested, try to keep the shard size small as it is a search use case. 10GB average size would mean around 180 primary shards. That is 360 total shards, or 90 per node, which is no problem.

1 Like

I considered routing on specific routing_key as the best possible approach, since all the functionalities are targeted to one routing_key and there is no possibility that I might need to mix documents from more routing keys. So I kind of accepted the drawback of uneven shard sizes.
But if there is a better approach, that I'm not aware of, I would definitely consider it.

No, that is the approach I would recommend. I suspect you will see improved performance if you upgrade Elasticsearch and reduce the shard size.

As the data set is large it is possible that you may also suffer from resource constraints in the cluster, e.g. disk I/O, but as you have not provided any details about the cluster that is hard to tell. It is worth monitoring though.

1 Like

Here are my hardware specifications:

  • Total Memory: 201 GiB (1 + 50Ă—4 = 201 GiB)
  • Total CPU: 32.5 cores (0.5 + 8Ă—4 = 32.5 cores)
  • Total Storage: 10,001 GiB (1 + 2,500Ă—4 = 10,001 GiB)
  • Total Nodes: 5 (1 master + 4 data nodes)

Running a cluster with a single master eligible node is very bad as it does not offer any resiliency. If you in total have 5 nodes I would have 1 dedicated master node as you do now and 4 master/data nodes.

What type of storage is this? Local SSD?

Yes! More on this at: Node roles | Elastic Docs

The master node is HDD (standard-rwo) and the other nodes are local SSD (premium-rwo).