I necessitate to query multiple indices and I was evaluating the alternative to listing all indices in the URL.
Aside of aliases, It seems I could use the meta-field _index to list the indices in the payload of my request. However, I was wondering if that approach has any impact on which shards are going to be searched on.
In other words, if I specify _index equal to 'index1' and 'index2' and if my ES contains 40 indices each with one shard, is the query specifying:
"indices" : ["index1", "index2"],
going to target only the two shards of those two indices or all the 40 shards present in the instance and then exclude those that don't meet the criteria afterwards?
Good question! Yes, it will have an impact. When you do this:
GET /index1,index2,index3/_search
{
"query": { ... }
}
The coordinating node will identify the list of shards that need to be queried based on the indices in the URL list. THe query is then dispatched to the appropriate nodes and shards, only executing where appropriate.
If instead you do something like this, adding a query on _index:
What happens is that the coordinating node is forced to dispatch the request to all shards in the index. Each shard will then parse and execute the query locally. On shards where the index does not match, the _index query will be rewritten to a MatchNoDocs query and essentially act as a no-op. On matching shards it will execute as normal.
So it does add overhead; mostly network traffic but also a bit of query rewriting. The _index field is mainly there for when you need to have different parts of your query responding to different indices. If the query should be executed identically on all indices, just add them to the URL and you'll avoid some extra, unnecessary network traffic.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.