Severe Performance Degradation After Adding High-Cardinality Text Field to Large Index

Dimitris_Makris · May 27, 2025, 7:52am

I'm experiencing a significant performance issue after adding a new multi-valued text field to a large Elasticsearch index and would appreciate insights from anyone who has faced similar challenges.

Setup:

Elasticsearch 8.13.0, 4-node cluster
Index: 400M documents, 18 shards (~100GB to 150GB per shard)
Routing by a specific field let's say routing_id (there are thousands of routing_id)
New field: names (text with edge n-gram analyzer, 2-20 chars)
Multi-valued: 3-4 entries per document on average

Problem:
Initially tested the field on 150K documents with excellent performance (milliseconds). After backfilling all 400M documents, query performance degraded catastrophically from milliseconds to 10+ seconds.

Mapping:

{
  "mappings": {
    "properties": {
      "routing_id": {"type": "keyword"},
      "unique_id": {"type": "keyword"},
      "names": {
        "type": "text",
        "analyzer": "edge_ngram_analyzer",
        "search_analyzer": "standard",
        "fields": {
          "keyword": {"type": "keyword"}
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "edge_ngram_analyzer": {
          "type": "custom",
          "tokenizer": "edge_ngram_tokenizer",
          "filter": ["lowercase"]
        }
      },
      "tokenizer": {
        "edge_ngram_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 20,
          "token_chars": ["letter","digit"]
        }
      }
    }
  }
}

Query Pattern:

GET index/_search?routing=123
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {"term": {“routing_id”: 123}},
        {"match": {“names”: {"query": "searchterm", "fuzziness": "1"}}}
      ]
    }
  },
  "aggs": {
    "distribution": {
      "terms": {"field": "names.keyword", "size": 100},
      "aggs": {
        "unique_count": {"cardinality": {"field": “unique_id"}}
      }
    }
  }
}

Goal:
I'm implementing an autocomplete feature for names with the following requirements:

Interactive Autocomplete: As users type names (e.g., “Joh”), show matching names (e.g., “Johannes, “John Borris”, “Some other John”)
Document Counts: For each suggested name, display how many documents exist (need to count only the unique ones based on a field called unique_id)
Substring Matching: Users should be able to search for any part of a name, not just prefixes (e.g., “tris” should match “Dimitris”) or the second word of a name (e.g. “Makr” should match “Dimitris Makris”)
Routing Isolation: Results must be filtered by routing_id
Performance Requirements: Sub-second response time for interactive user experience

Analysis:
Despite routing and filtering to ~5,000 documents or so, performance is poor (like 10 seconds). I believe this is due to term dictionary size explosion - the field now contains terms from 400M × 3-4 = ~1.2B entries, and ES must traverse the entire term dictionary before applying filters. Also the nature of the field is high cardinality which I assume doesn't help on this situation.

Attempted Optimizations:

Replacing match+fuzziness with prefix queries
Adjusting refresh intervals
execution_hint: "map" for aggregations

Planned Optimizations:

Reindexing in order to reduce the size of each shard (but I'm not sure if the performance will become acceptable)

Questions:
What's the best architectural approach for this type of "autocomplete with aggs" functionality at scale? Should I:

Optimize the current approach?
Move to a separate autocomplete-specific index?
Use completion suggester with some additions maybe?
Consider a different approach entirely?

Any insights on handling autocomplete functionality at this scale with the above-mentioned functionality would be greatly appreciated!

dadoonet · May 27, 2025, 7:57am

Welcome!

Yes. That's the first thing I'd do. Have a look at Size your shards | Elastic Docs

Very large shards can slow down search operations and prolong recovery times after failures.

Aim for shard sizes between 10GB and 50GB

What would be interesting in your case would be to find "when" it starts to degrade in terms of performances. After how many docs, for what exact size...
As you are doing full text search here (it's not a time series dataset), I'd keep the shard size low, like 10gb.

As you are using specific routing keys, do you have shards much bigger than others? What is the GET _cat/shards?h command returns for your index?

Christian_Dahlqvist · May 27, 2025, 8:01am

What is the hardware specification of this cluster? Are you using local SSDs?

I would also recommend upgrading to the latest 8.18 version as there are have been some performance improvements made that I am not sure are in your version.

This is a quite large shard size and can affect performance. I would recommend trying to reindex the data into a new index with 80-100 primary shards.

Using routing is the right way to go as long as each document only has a single routing value. This will force all documents that contain a specific routing value to be located in the same shard and when you specify routing at query time only a single shard will need to be searched, which together with smaller shards should help improve latency and increase supported query throughput.

Dimitris_Makris · May 27, 2025, 8:16am

Hi David, thank you very much for your reply.

Actually the suggestion on "when" it starts to degrade is super interesting. I will definitely look into this.

Regarding the shard size.. The shard sizes range from 100 to 150 GB

Dimitris_Makris · May 27, 2025, 8:22am

Hi Christian, thank you very much for your reply.

I have a similar view regarding the number of shards, but I'm unsure how to distribute them across nodes. Is there a hard limit on the number of shards per node?

I actually use routing (I updated the query to reflect it), but in that case I also use it for filtering. Imagine that each shard may contain ~ 50k routing keys and approximately 22 million documents.

I will also look into the newer version of ES.

Christian_Dahlqvist · May 27, 2025, 9:14am

The limit per node is 1000, so you should be fine. Make sure you set the number of replica shards to 1 to reduce overall shard count.

Routing can result in uneven shard sizes depending on the distribution of routing keys, but efficiency generally increases with the primary shard count. As David suggested, try to keep the shard size small as it is a search use case. 10GB average size would mean around 180 primary shards. That is 360 total shards, or 90 per node, which is no problem.

Dimitris_Makris · May 27, 2025, 9:40am

I considered routing on specific routing_key as the best possible approach, since all the functionalities are targeted to one routing_key and there is no possibility that I might need to mix documents from more routing keys. So I kind of accepted the drawback of uneven shard sizes.
But if there is a better approach, that I'm not aware of, I would definitely consider it.

Christian_Dahlqvist · May 27, 2025, 9:48am

No, that is the approach I would recommend. I suspect you will see improved performance if you upgrade Elasticsearch and reduce the shard size.

As the data set is large it is possible that you may also suffer from resource constraints in the cluster, e.g. disk I/O, but as you have not provided any details about the cluster that is hard to tell. It is worth monitoring though.

Dimitris_Makris · May 27, 2025, 11:25am

Here are my hardware specifications:

Total Memory: 201 GiB (1 + 50×4 = 201 GiB)
Total CPU: 32.5 cores (0.5 + 8×4 = 32.5 cores)
Total Storage: 10,001 GiB (1 + 2,500×4 = 10,001 GiB)
Total Nodes: 5 (1 master + 4 data nodes)

Christian_Dahlqvist · May 27, 2025, 11:35am

Running a cluster with a single master eligible node is very bad as it does not offer any resiliency. If you in total have 5 nodes I would have 1 dedicated master node as you do now and 4 master/data nodes.

What type of storage is this? Local SSD?

dadoonet · May 27, 2025, 11:40am

Yes! More on this at: Node roles | Elastic Docs

Dimitris_Makris · May 27, 2025, 12:02pm

The master node is HDD (standard-rwo) and the other nodes are local SSD (premium-rwo).

Topic		Replies	Views
Index routing and shard size Elasticsearch	4	381	July 6, 2017
ElasticSearch at scale Elasticsearch	4	1635	July 6, 2017
Having large number of documents over one shard Vs multiple shards Elasticsearch	1	363	July 6, 2017
Multi-tenancy performance Elasticsearch	7	1647	July 6, 2017
Perplexing benchmark result Elasticsearch	3	411	July 6, 2017

Severe Performance Degradation After Adding High-Cardinality Text Field to Large Index

Related topics