Elasticsearch query timeouts on data stream (50M–170M docs, facets + NOT queries)

Hi all,

I’m facing Elasticsearch query timeouts for a search workload stored in a data stream (immutable data, bulk indexed historical data). We are using Point-in-Time (PIT) searches for query consistency/pagination.

Backing index examples

One large backing index:

  • PRI=10, REP=2

  • docs.count = 170,461,279

  • store.size = 743.3GB

  • pri.store.size = 247.4GB

Another backing index with fewer docs also times out:

  • PRI=10, REP=2

  • docs.count = 49,764,512 (example)

  • still seeing query timeouts

Rollover setup

ILM rollover configured:

  • max_primary_shard_size = 30GB

  • max_size = 300GB

Rollover did not happen for the 170M index because shard sizes and total primary size are still below these thresholds.

Query pattern

Typical query returns top 25 results and often includes facets:

{
  "searchText": "NOT ABC",
  "top": 25,
  "includeFacets": true,
  "filters": { "projectName": ["X"] }
}

We execute searches using Point-in-Time (PIT) for consistent pagination.

Queries may include facets via aggregations

Filters are applied using post_filter, and pagination is done using from/size (Skip/Take).

Note: We are currently not using custom routing. We are relying on Elasticsearch’s default routing behavior for data streams (routing based on document _id), so searches may fan out across all primary shards/backing indices.

Questions

  1. What are the most common causes of search timeouts on indices with ~50M documents (10 primary shards), even when requesting only the top N results (e.g., 25)?
  2. What are recommended ways to optimize queries that include facets/aggregations (and negative terms like NOT), to reduce latency and prevent timeouts?
  3. For large historical backfills into a data stream, what rollover strategy is recommended should we rely more on max_docs, max_primary_shard_size, or a combination of both?
  4. Should we use custom routing, or we can rely on default routing for different project/groups.

Which version of Elasticsearch are you using?

What is the size and specification of the cluster in terms of CPU, RAM and type of storage used?

What does the mappings for the index look like?

What does a sample query with aggregations look like?

What query latencies are you experiencing? Does this occur when you issue just a single query agaist the cluster or does it require a number of concurrent queries?

Do you always filetr on project name? If so, how many different project names are there in the index?

Hi @Christian_Dahlqvist Thank you for your response,
Please find the requested information below
Elasticsearch version

  • Elasticsearch 8.14

Cluster specs

  • Data nodes: 8 vCPUs, 64 GiB RAM, SSD 2 TB

  • Query/coord nodes: 8 vCPUs, 64 GiB RAM

Observed latency / timeouts

  • Many searches timeout at ~60 seconds

  • It is more likely to timeout when we query at group level (group contains many projects)

  • Queries sometimes work for smaller repos, but most fail for larger scopes

  • We see timeouts even with relatively small result size (top 25)

  • This can happen even with a single query, but is worse under higher request volume (concurrent traffic)

Filters / selectivity

  • We usually apply group-level and/or project-level filters

  • There are many projects, but limited groups

Sample Request:
// slightly modified not the actual request

Summary

{
"searchFilters": {
"projectName": [ "SampleProject" ],
"projectIdentifier": [ "sample-project-id" ],
},
"options": [ "Faceting", "Highlighting" ],
"skipResults": 0,
"takeResults": 50,
"orderBy": [
{ "Field": "eventDate", "SortOrder": "Desc" }
],
"fields": [
"projectIdentifier",
"projectName",
"identifier",
"recordType",
"itemCategory",
"eventId",
"eventTitle",
"eventDescription",
"eventDate",
"authorName",
"authorEmail",
"authorDate",
"performerName",
"performerEmail",
"indexedTimestamp",
"actionDate",
"@timestamptimestamp"
],
"highlightFields": [
"eventTitle",
"eventDescription",
"authorName",
"performerName"
],
"terminateAfter": 0,
"keepAlive": "1m",
"pitId": null,
"continueOnEmptyQuery": false,
"searchAfter": null,
"scopeFiltersExpression": {
"type": "And",
"children": [
{
"type": "Term",
"field": "recordType",
"operator": "Equals",
"value": "sample-record-type"
}
]
},
"queryParseTree": {
"type": "Term",
"field": "eventTitle",
"operator": "Contains",
"value": "test"
}
}

**
Index mapping**

Below is a partial mapping snippet (data stream enabled). Key points:

  • @timestamp, authorDate, eventDate are date (epoch_second)

  • Many searchable fields are text with analyzers

  • Some fields also have keyword subfields (raw) with eager_global_ordinals=true

{
  ".ds-datastream_XX_8140-bc49432d81a7-2026.01.20-000001": {
    "mappings": {
      "_meta": {
        "version": 8140
      },
      "_data_stream_timestamp": {
        "enabled": true
      },
      "properties": {
        "@timestamp": {
          "type": "date",
          "format": "epoch_second"
        },
        "authorDate": {
          "type": "date",
          "format": "epoch_second"
        },
        "authorEmail": {
          "type": "text",
          "fields": {
            "raw": {
              "type": "keyword",
              "eager_global_ordinals": true
            }
          },
          "index_options": "offsets",
          "analyzer": "unstemmedFullTextAnalyzer"
        },
        "authorName": {
          "type": "text",
          "fields": {
            "pattern": {
              "type": "text",
              "index_options": "offsets",
              "analyzer": "contentAnalyzer"
            },
            "raw": {
              "type": "keyword",
              "eager_global_ordinals": true
            }
          },
          "norms": false,
          "analyzer": "LowerCaseAnalyzer"
        },
        "entityId": {
          "type": "text"
        },
        "entityName": {
          "type": "text",
          "fields": {
            "raw": {
              "type": "keyword"
            }
          },
          "norms": false,
          "analyzer": "LowerCaseAnalyzer"
        },
        "entityNameOriginal": {
          "type": "text"
        },
        "eventDate": {
          "type": "date",
          "format": "epoch_second"
        },
        "eventDescription": {
          "type": "text",
          "index_options": "offsets",
          "norms": false,
          "analyzer": "contentAnalyzer"
        },
        "eventId": {
          "type": "text"
        },
        "eventTitle": {
          "type": "text",
          "index_options": "offsets",
          "norms": false,
          "analyzer": "contentAnalyzer"
        },
        "performerEmail": {
          "type": "text",
          "fields": {
            "raw": {
              "type": "keyword",
              "eager_global_ordinals": true
            }
          },
          "index_options": "offsets",
          "analyzer": "unstemmedFullTextAnalyzer"
        },
        "performerName": {
          "type": "text",
          "fields": {
            "pattern": {
              "type": "text",
              "index_options": "offsets",
              "analyzer": "contentAnalyzer"
            },
            "raw": {
              "type": "keyword",
              "eager_global_ordinals": true
            }
          },
          "index_options": "offsets",
          "analyzer": "unstemmedFullTextAnalyzer"
        },
        "recordType": {
          "type": "text"
        },
        "identifier": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "indexedTimestamp": {
          "type": "date",
          "format": "epoch_second"
        },
        "itemCategory": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "projectIdentifier": {
          "type": "text"
        },
        "projectName": {
          "type": "text",
          "fields": {
            "raw": {
              "type": "keyword",
              "eager_global_ordinals": true
            }
          },
          "norms": false,
          "analyzer": "LowerCaseAnalyzer"
        },
        "projectNameOriginal": {
          "type": "text"
        },
        "initiator": {
          "type": "text",
          "fields": {
            "pattern": {
              "type": "text",
              "index_options": "offsets",
              "analyzer": "contentAnalyzer"
            },
            "raw": {
              "type": "keyword",
              "eager_global_ordinals": true
            }
          },
          "index_options": "offsets",
          "analyzer": "unstemmedFullTextAnalyzer"
        },
        "actionDate": {
          "type": "date",
          "format": "epoch_second"
        },
          "norms": false,
          "analyzer": "LowerCaseAnalyzer"
        }
      }
    }
  }
}

Have you tried profiling a few queries that are slow?

If so, what was the result?

Can you show a sample query that is slow exactly as it is sent to Elasticsearch?