Historically, date range searches have been difficult to do right
(performance and memory wise) on top of Lucene. The main reason in the
past has been that a range search was translated into a logical OR of
all the terms that fell into the range. If the range is large enough
(e.g. long time span) you get an OR of many, many terms with the to be
expected impact on runtime. Sometimes you even exceed the max number
of OR clauses.
A common (and good) solution to the date range problem in the past was
to rely on decomposition of the date at indexing time. You would
basically index a date_time into six fields (year, month, ..., second)
and create an appropriate logical AND clause at query time. The main
advantage of this approach is that it limits the number of terms
possible for the otherwise infinite set of millisecond values. The
second advantage is that you can be very flexible about precision
matching. To get all documents of a month you do not even need a range
(but match on the year and month field only). Is there any support for
this type of decomposition within the ES mapping ?
I know that range searches have improved at the Lucene level via the
TrieField concept for numeric fields. Is this used by ES and does this
completely solve the performance issue? How about index size when
storing lots of time stamps with second precision (think about the
number of terms in the index)?
First of all, a lot has changed in Lucene since then as you noted.
elasticsearch uses the new trie based number types (a numeric type in ES
translates to a numeric type in lucene). Doing range queries on them is much
more efficient.
Still, those can be enhanced. One way to enhance it, and what is recommend
to elasticsearch users is to have range filters. Range filters are
automatically cached, this means that doing a range query for the year 2009
will be executed once, and then the result of the filter will be cached and
used against any other query giving much better results. Same can be applied
for months.
Also, in 0.12, there is also an option to have a numeric_range filter
explained here: http://www.elasticsearch.com/docs/elasticsearch/rest_api/query_dsl/numeric_range_filter/.
It will be much faster, at the expense of loading the dates to memory (64
bit per value). If you are going to sort on it or do some facets on it
(histogram / range facets are great for date type data), then you might as
well use it.
Historically, date range searches have been difficult to do right
(performance and memory wise) on top of Lucene. The main reason in the
past has been that a range search was translated into a logical OR of
all the terms that fell into the range. If the range is large enough
(e.g. long time span) you get an OR of many, many terms with the to be
expected impact on runtime. Sometimes you even exceed the max number
of OR clauses.
A common (and good) solution to the date range problem in the past was
to rely on decomposition of the date at indexing time. You would
basically index a date_time into six fields (year, month, ..., second)
and create an appropriate logical AND clause at query time. The main
advantage of this approach is that it limits the number of terms
possible for the otherwise infinite set of millisecond values. The
second advantage is that you can be very flexible about precision
matching. To get all documents of a month you do not even need a range
(but match on the year and month field only). Is there any support for
this type of decomposition within the ES mapping ?
I know that range searches have improved at the Lucene level via the
TrieField concept for numeric fields. Is this used by ES and does this
completely solve the performance issue? How about index size when
storing lots of time stamps with second precision (think about the
number of terms in the index)?
thanks for the fast reply. The usage of trie fields at the Lucene
level is good news. However, I was wondering whether support for date
decomposition as part of the ES mapping would still be a useful
feature request. It does not seem difficult to support given the
existing multi-field mappings. Using a year / month / day
decomposition would allow for great flexibility because one would not
even need range filters for many use cases.
Sure, it can be added, still I am not sure that the use case is that
common anymore. Regarding years, adding a filter range query that is cached
will be better, sure you can create a range filter on the lexi order of the
years, but the end results is the same (though, agreed, on the expense of
slower initialization of the cache). Months and days, I am not sure.
Usually, you want one from a specific year (or range), and its usually also
a great candidate for filters range query that is cached.
There are cases where it might make sense, not saying that it doesn't,
just trying to understand if its not a case of pre-optimizing something that
is nicely solved otherwise / with current features.
In any case, you can open a feature request for it, should not be
difficult to implement.
thanks for the fast reply. The usage of trie fields at the Lucene
level is good news. However, I was wondering whether support for date
decomposition as part of the ES mapping would still be a useful
feature request. It does not seem difficult to support given the
existing multi-field mappings. Using a year / month / day
decomposition would allow for great flexibility because one would not
even need range filters for many use cases.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.