Range vs Terms / Date vs Integer what is the speediest?

xavierfacq · June 1, 2017, 1:53pm

Hi guys,

I'm trying to super-optimize somes queries, but when trying to change a RangeQuery on a Date field,
I'm very surprised to do not have a better time using an Integer with TermsQuery.

I have one field: a Date field, lets say "2017-06-01"
I add another field: an Integer field storing only year + month as int : 201706

What we need is to have results having the same year for exemple (sometimes 2 months around a date):

Solution 1°/ (my initial query)

date field + range query like :

      .from("2017-06-01||/y")
      .to("2017-06-01||+1y/y")
      .includeUpper(false);

Solution 2°/

integer field + range query like :

      .from(201701)
      .to(201712)
      .includeUpper(true);

Solution 3°/

integer field + terms query like :

      termsQuery("fieldname", [201701, 201702, 201703, 201704 ... 201712])

None of the three is better than other, very surprised !!!

Any feedback or advice ?

bye,
Xavier

polyfractal · June 1, 2017, 6:30pm

Yep, that's not unsurprising. Under the covers, dates are stored as simple numbers, so the comparison between number vs date sounds entirely reasonable.

The results between Range and Term is also expected, but the reasoning is version-dependent.

If you're on a 5.x version, numerics switched over to a new underlying datastructure called the BKD tree. This structure allows very efficient range lookup, and you can think of Term/Terms as ranges which only encompass a single value. So Ranges vs Term in 5.x are essentially the same underlying mechanism and you won't see any performance difference.

In older versions of Elasticsearch, you can potentially see this performance similarity too. Older Lucene versions had an optimization where Range queries are rewritten into a set of Term queries if the segment being queried only had a handful of matching terms. E.g. the Range may be [10...1000], but the segment only has numbers [11,12]. In this case it is faster to execute a Term for 11 & 12 instead of a full Range query. So it's essentially doing the optimization you were attempting to benchmark.

In cases where many numbers match on a segment, the setup cost of a full Range is outweighed by the iteration cost of many Terms, so a NumericRange is executed instead.

Hope that helps! Basically, Lucene has your back in regards to optimizing the best execution plan for numerics, you don't really need to worry about it

xavierfacq · June 2, 2017, 7:05am

Hi,

Thank you @polyfractal for all this explanations, it's very clear now ! I will keep the original solution, easy to read and understand. I was thinking that the date "convertion" had a cost ("2017-06-01||+1y/y") but it seems not, so go ahead with it !

Elasticsearch will never stop to surprise me !

bye,
Xavier

system · June 30, 2017, 7:05am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance of Range queries on Keyword fields Elasticsearch	3	152	April 14, 2024
Efficient date range handling Elasticsearch	4	2685	July 6, 2017
Difference range query between numeric type and range type Elasticsearch	2	318	July 29, 2021
Why term query, keyword is faster than number Elasticsearch	2	641	March 19, 2021
Performance querying time-based indices in a date range Elasticsearch	3	2376	August 3, 2020

Range vs Terms / Date vs Integer what is the speediest?

Related topics