I'm curious to know how Elasticsearch handles queries on time based indices.
ex. If I have 5 years of time based indices named eventdata-{dd-MM-yyyy} and I am querying them using eventdata-*, does that lead to the orchestrating node having to ask each and every node, even if there's a date range filter in the query? And, does each node keep some index metadata that enables it to skip indices that can't match the criteria, based on some field min/max statistics?
In a hot/cold scenario, it's interesting to know how much I can spare the cold nodes from being queried, while still using wildcards in the index eventdata-* .
I think this blog post and other two posts linked at the beginning of it should answer your questions. But if there are still some questions remain, please don't hesitate to ask them here.
Thanks for the links - very informative, learned a ton!
I can see that there is definitely min/max range information present that a node can use for caching decisions, and to determine if an index is relevant for a range query or not.
However, I am still wondering if the orchestrating node has that information as well, such that it can completely ignore nodes that only contain indices that could never match the range in the query.
I'm wondering because I'd like to get an idea how significant the difference between a wildcard index match vs. using index date math matching would be.
So, the coordinating node indeed knows the name of the index in some format, let's say blah-20181201 and it has access to the query, so it can theoretically analyze the query and figure out that it contains a filter on some field with the type date and we can teach it to extract date from the index. What is missing here, however, is non-trivial connection between the name of the index and the field. In particular the fact that the field that we see in the query is the actual field that determines which index the document will go into and the guarantee that index blah-20181201 will only contain documents with values between 00:00:00 and 23:59:59 UTC (or is it EST or maybe CET) on Dec 1, 2018 and who can guarantee that? Basically, such mechanism would be useless without another mechanism that would prevent "wrong" records from being added to the index. And that really complicates things and limits you to a single field per index.
That makes perfect sense and thanks for taking the time to explain.
I was thinking about this line, from the article you led me to where it says:
... a new utility in Lucene that allows us to fetch the upper and lower bounds of a field from a top-level reader....
if the coordinating node had access to that for all nodes in the cluster state (guess that's where it would go), it could provide that exact guarantee and basically skip over nodes and indices that couldn't ever match the range query.
I guess though, that it doesn't have access to that.
Indeed, the coordinating node can ask any data node for bounds, but how would the coordinating node know that bounds didn't change since it checked with a data node the last time? In other words to make this bulletproof, we would need to go to all data nodes and ask them for an update for every requests, right?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.