Range vs Terms / Date vs Integer what is the speediest?


(Xavier Facq) #1

Hi guys,

I'm trying to super-optimize somes queries, but when trying to change a RangeQuery on a Date field,
I'm very surprised to do not have a better time using an Integer with TermsQuery.

I have one field: a Date field, lets say "2017-06-01"
I add another field: an Integer field storing only year + month as int : 201706

What we need is to have results having the same year for exemple (sometimes 2 months around a date):

Solution 1°/ (my initial query)

date field + range query like :

      .from("2017-06-01||/y")
      .to("2017-06-01||+1y/y")
      .includeUpper(false);

Solution 2°/

integer field + range query like :

      .from(201701)
      .to(201712)
      .includeUpper(true);

Solution 3°/

integer field + terms query like :

      termsQuery("fieldname", [201701, 201702, 201703, 201704 ... 201712])

None of the three is better than other, very surprised !!!

Any feedback or advice ?

bye,
Xavier


(Zachary Tong) #2

Yep, that's not unsurprising. Under the covers, dates are stored as simple numbers, so the comparison between number vs date sounds entirely reasonable.

The results between Range and Term is also expected, but the reasoning is version-dependent.

If you're on a 5.x version, numerics switched over to a new underlying datastructure called the BKD tree. This structure allows very efficient range lookup, and you can think of Term/Terms as ranges which only encompass a single value. So Ranges vs Term in 5.x are essentially the same underlying mechanism and you won't see any performance difference.

In older versions of Elasticsearch, you can potentially see this performance similarity too. Older Lucene versions had an optimization where Range queries are rewritten into a set of Term queries if the segment being queried only had a handful of matching terms. E.g. the Range may be [10...1000], but the segment only has numbers [11,12]. In this case it is faster to execute a Term for 11 & 12 instead of a full Range query. So it's essentially doing the optimization you were attempting to benchmark.

In cases where many numbers match on a segment, the setup cost of a full Range is outweighed by the iteration cost of many Terms, so a NumericRange is executed instead.

Hope that helps! Basically, Lucene has your back in regards to optimizing the best execution plan for numerics, you don't really need to worry about it :slight_smile:


(Xavier Facq) #3

Hi,

Thank you @polyfractal for all this explanations, it's very clear now ! I will keep the original solution, easy to read and understand. I was thinking that the date "convertion" had a cost ("2017-06-01||+1y/y") but it seems not, so go ahead with it !

Elasticsearch will never stop to surprise me !

bye,
Xavier


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.