We have employed a hot-warm architecture in our cluster, and the indices share the same prefix. Right now, to keep our process sane, we do this to our indices:
Reindex the original index named name-YYYY.MM into a new index called name--YYYY.MM
Shrink the new index
Force merge the new index
When done, move the name--YYYY.MM index into the warm node.
We did it this way so that we can still very easily search against the warm nodes by selecting index='name-*'. However, sometimes we would like to search against the same index on the hot nodes only.
Is there a way to parse the node type parameter to the search? Right now we have partially gotten around this by using index='name-YYYY*' but if there is in fact a better way, one where I can specific node attributes when performing a search?
(I obviously realize that I could’ve named the new index differently but when we made this naming convention originally, we didn’t consider this particular use case)
Does your data include a timestamp field? If so, I would recommend simply searching everything and using a range filter on the timestamp field. Elasticsearch will automatically (and efficiently) skip any shards that don't contain data matching a timestamp range.
The search has to "hit" one copy of every shard at least once -- it clearly needs to check the timestamp range to know whether to skip that shard or not. But that's all it does if the shard doesn't match, so it's very quick.
IIRC this behaviour is reflected in the profile API output - you shouldn't see profiler entries for the skipped shards since there's basically nothing to profile.
It may be worth mentioning that the shards might not be pre-filtered. Based on the documentation (and I've experienced instances first hand), pre-filtering only happens when one of a few conditions is met, one of which is that the number of shards that a search targets is ≥128 [docs].
@smlbiobot You might try adding pre_filter_shard_size=1 to your URL parameters for your search in addition to the time range filter suggested by @DavidTurner.
It's also very possible that I've misunderstood the conditions under which pre-filtering applies, and if so I'd love if someone could correct my understanding!
Yes that's true, although the general advice remains: just search everything with an appropriate filter and let Elasticsearch decide how best to execute the search. If you're not searching very many shards then pre-filtering doesn't happen because it's just as fast to query them all. You can try pre_filter_shard_size=1 if you like but it may not make any measurable difference or may even make it slightly slower.
@DavidTurner Thanks for the response. For my own understanding, would that depend on the complexity of the query and size of the shards? Is there a baseline recommendation in terms of pre_filter overhead? Thanks!
Not really no, it only helps if you're going to hit a lot of shards that don't match due to a timestamp range filter. The recommendation is not to set this argument so that the default logic applies.
Our specific use case for this is about 30 indices (so far. It‘s monthly so I expect this to grow) on the warm nodes set to 1 shard 1 replica (shrunk from 5 shards 1 replica on the hot nodes)
Though we are planning to use this strategy for other indices on the cluster so thank you both for your conversation as they are helpful.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.