If I have a index like below
index-2022
index-2023
index-2024
index-2025
dataview is "index-*" ( @timestamp is date field)
if I run a query with index-* and @timestamp > now() - interval 30 day
will it hit all the index or just recent one?
If I have a index like below
index-2022
index-2023
index-2024
index-2025
dataview is "index-*" ( @timestamp is date field)
if I run a query with index-* and @timestamp > now() - interval 30 day
will it hit all the index or just recent one?
Big topic @elasticforme
This all based on some concepts often called prefilter
but more accurately can-match
logic
There is a lot of subtlety here ...
Data Streams:
If these were data streams (which I suspect they are not), the indices/shards would get filtered in/out based on the filter and @timestamp
before the actual search in the can-match
stage (which is specialized for data streams) , before the actual search (query + fetch phases), so in that case any non-matching shards would be skipped....
Regular Indices:
If these are regular indices, not datastreams, that adds a bit more complexity... so take a look at this.... There is a logic/threshold when the can-match
phase is applied.
Lets look at this test case....
Create 3 indices, which are based on years and the data within them is within that year.
DELETE test-2025,test-2024,test-2023
POST test-2025/_doc
{
"@timestamp": "2025-05-03T17:22:52.592Z",
"message" : "its 2025"
}
POST test-2024/_doc
{
"@timestamp": "2024-04-03T17:22:52.592Z",
"message" : "its 2024"
}
POST test-2023/_doc
{
"@timestamp": "2023-03-03T17:22:52.592Z",
"message" : "its 2023"
}
A) If I search with just a normal _search
no shards are skipped
GET /test-2*/_search
{
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"format": "strict_date_optional_time",
"gte": "now-90d"
}
}
}
]
}
}
}
# Result No Shards Skipped
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0, <<< NO SHARDS SKIPPED :(
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0,
"hits": [
{
"_index": "test-2025",
"_id": "8r1Y8ZcB_y8EQ5ex3ukb",
"_score": 0,
"_source": {
"@timestamp": "2025-05-03T17:22:52.592Z",
"message": "its 2025"
}
}
]
}
}
BUT
B) if I do an _async_search, which is what Discover does.... the shards are skipped
POST /test-2*/_async_search
{
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"format": "strict_date_optional_time",
"gte": "now-90d"
}
}
}
]
}
}
}
# Result skipped shards
{
"is_partial": false,
"is_running": false,
"start_time_in_millis": 1752101084860,
"expiration_time_in_millis": 1752533084860,
"completion_time_in_millis": 1752101084862,
"response": {
"took": 2,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 2, <<< SKIPPED SHARDS YAY!!
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0,
"hits": [
{
"_index": "test-2025",
"_id": "8r1Y8ZcB_y8EQ5ex3ukb",
"_score": 0,
"_source": {
"@timestamp": "2025-05-03T17:22:52.592Z",
"message": "its 2025"
}
}
]
}
}
}
So why is this?
Whether or not the can-match
phase runs is determined by the pre_filter_shard_size
setting (Run a search | Elasticsearch API documentation) - default 128
For async search, the default is 1. That is mentioned here .
pre_filter_shard_size
defaults to1
and cannot be changed: this is to enforce the execution of a pre-filter roundtrip to retrieve statistics from each shard so that the ones that surely don’t hold any document matching the query get skipped.
So your Query in Discover will be _async
so it will take advantage of the can-match
but if you just run a query in Dev Tools it will not take advantage unless it looks like it will hit 128+ shards ... Unless you set it...
GET /test-2*/_search?pre_filter_shard_size=1 <<< HERE
{
"query": {
....
Hope this helps
Perefect.
yes in my case my index is not datastream in this case and hence I was seeing index-2018 all the way to index-2025 on GET /_tasks.
But I have data only for that year in index-yyyy. and client is using python to search data. they was using lazy search that was without anything just "search job=xyz".
Looks like I have to test on python side on how to use async search and or pass a single index for search rather then index-* search only index-curentYYYY
Thanks for this great Explanation.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.