More efficient date sorting

Matthew_Painter · November 22, 2013, 7:40pm

Hi all,

We have an index with ms precision dates stored as longs.

To sort on this, if I understand correctly, we need to load all of the
longs into memory in the field cache.

However, if we know that all of the dates are < now(), we could use custom
scoring with a decay function to more efficiently sort the result set.

Is this a good idea - or crazy?

Matt

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · November 22, 2013, 9:30pm

Custom scoring is expensive.

If you can restrict your sorting domain to int range, you do not even need
to encode dates as longs, just use ints instead (or even bytes).

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Matthew_Painter · November 22, 2013, 10:37pm

Fair point In this case it would be possible to map the longs to ints as
a multi field value at index time? E.g. Ms since epoch =>Minutes since X

Strikes me that having control over sort granularity in es would be a cool
feature.

On Friday, November 22, 2013, joergprante@gmail.com wrote:

Custom scoring is expensive.

If you can restrict your sorting domain to int range, you do not even need
to encode dates as longs, just use ints instead (or even bytes).

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Mb9XwhA34j8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com <javascript:_e({}, 'cvml',
'elasticsearch%2Bunsubscribe@googlegroups.com');>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Sent from Gmail Mobile

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Anant_Aneja · September 25, 2014, 8:05am

Did you run any experiments comparing sorting on dates vs the custom scoring suggestion you made ?

Matthew_Painter · September 25, 2014, 6:08pm

No. But I would suggest that mapping from a long to a int would be
obviously more performant.
On 25 Sep 2014 15:16, "Bleh" anant.aneja@gmail.com wrote:

Did you run any experiments comparing sorting on dates vs the custom
scoring
suggestion you made ?

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/More-efficient-date-sorting-tp4044842p4063997.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Mb9XwhA34j8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1411632317090-4063997.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAG_V-y6hCzX%2B8eH%3DFLA%3DVL4LV2-FLOLguZ6J60FW8OYPuUwg9A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

ananth · September 26, 2014, 11:21am

Hi,

Initially we too used System.currentTimeInMillis() . Then we switched to 2
int fields something like yyyyMMdd & HHmmssSSS.

If query's time criteria falls with in single date then we dont apply yyyyMMdd
field for sorting. We have decent performance compared with
System.currentTimeInMillis().

Hi Jörg,

How much memory will es take when applying aggregation on a long field
which contains ~80 million (1000 * 60 * 60 * 24 unique millis in a day =
86400000 ) unique long values.

If i understand correctly 80 million * 8 bytes for long . i.e., 86400000 *
8 = 691200000 (659 MB). Incase of yyyyMMdd as int field , 86400000 * 4
= 345600000 (329 MB).

What is the role of lucene's packedInt here in this case? Sorry if i
missing something.

Also we are using doc values with default option.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7e21797e-d5b9-4e9c-af90-58c693fad89b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · September 26, 2014, 1:54pm

Your formula is not correct.

yyyMMdd would map all values of a day to a single integer and you get
something like "sort by day" or "filter by day".

Assumed you have a normal distribution and you have a year of timestamps,
you can estimate: 80mio / 365 = 219.178 timestamps per day. In the "day
field", you have only 365 integers in the cache instead of 80mio longs for
unique millis. If "day" is too coarse, you can add an hour, minute, second
index.

Jörg

On Fri, Sep 26, 2014 at 1:21 PM, Anantha Govindarajan <
ananthagovindarajan@gmail.com> wrote:

Hi,

Initially we too used System.currentTimeInMillis() . Then we switched to 2
int fields something like yyyyMMdd & HHmmssSSS.

If query's time criteria falls with in single date then we dont apply yyyyMMdd
field for sorting. We have decent performance compared with
System.currentTimeInMillis().

Hi Jörg,

How much memory will es take when applying aggregation on a long field
which contains ~80 million (1000 * 60 * 60 * 24 unique millis in a day =
86400000 ) unique long values.

If i understand correctly 80 million * 8 bytes for long . i.e., 86400000 *
8 = 691200000 (659 MB). Incase of yyyyMMdd as int field , 86400000 * 4
= 345600000 (329 MB).

What is the role of lucene's packedInt here in this case? Sorry if i
missing something.

Also we are using doc values with default option.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7e21797e-d5b9-4e9c-af90-58c693fad89b%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7e21797e-d5b9-4e9c-af90-58c693fad89b%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGg86_qMgymP3_g0XM8off7o3_PL-Ajp7hdkUjmepHygQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

ananth · September 26, 2014, 4:50pm

Hi , Sorry I intended to say HHmmssSSS field . When I apply sorting or aggregations on HHmmssSSS field how much memory will it take ? In this case number of unique values for HHmmssSSS field can be 8640000(~80.6 million) . FYI: We are maintaining daily indexes . when user trying to search across days (for example last 7 days) , I will sort by both yyyyMMdd & HHmmssSSS. If user searches for single day alone ( for example today) , I will sort by only HHmmssSSS field alone )

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/723eb22f-7301-44ff-bc8d-8ccaffcae77c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ananth · September 26, 2014, 5:16pm

Hi Jörg, Sorry
I intended to say for HHmmssSSS field . How much memory will es take when I
apply sorting or aggregations on HHmmssSSS field . In this case number of
unique values for HHmmssSSS field can be 86400000(~80.6 million.
Note : We are creating daily indexes . If user searches
on multiple dates (for example last 7 days ) , then I will sort by both
yyyyMMdd & HHmmssSSS. If user searches for single day (for example today )
the I will apply sorting on HHmmssSSS field alone .

--

--

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEX_wjpbFT-efdyXGbaOpPyV-dAR%3Df%3Dw0%3DEy6c5C3nPjnPcn%2BA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

jprante · September 26, 2014, 6:35pm

Hi Anantha,

yes, that sounds reasonable to me. In that case you have two longs, and if
you filter by day, you can save resources for sorting within the day.

Jörg

On Fri, Sep 26, 2014 at 7:16 PM, Anantha Govindarajan <
ananthagovindarajan@gmail.com> wrote:

Hi Jörg, Sorry
I intended to say for HHmmssSSS field . How much memory will es take when I
apply sorting or aggregations on HHmmssSSS field . In this case number of
unique values for HHmmssSSS field can be 86400000(~80.6 million.
Note : We are creating daily indexes . If user searches
on multiple dates (for example last 7 days ) , then I will sort by both
yyyyMMdd & HHmmssSSS. If user searches for single day (for example today )
the I will apply sorting on HHmmssSSS field alone .

--

--

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEX_wjpbFT-efdyXGbaOpPyV-dAR%3Df%3Dw0%3DEy6c5C3nPjnPcn%2BA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEX_wjpbFT-efdyXGbaOpPyV-dAR%3Df%3Dw0%3DEy6c5C3nPjnPcn%2BA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFm8D7TbapNxd%3DN1ZytcgxWsbDy0_Mn_buaZJQU78_JJw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

ananth · September 27, 2014, 2:34am

Hi Jörg, We are dealing
with logs . If user debugs his code through logs I need apply sorting on
HHmmssSSS field . yyyyMMdd & HHmmssSSS both are int fields. How costly
applying sort on HHmmssSSS (80.6 million unique values) field? I am curious
to know whether lucene's packetInt plays a role here ? Or number of unique
values * 4 bytes ?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEX_wjryy3Ne1CDC20pWKUQaKUK%3Dz679BdyRyyo5EmOMmfsoHg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

ananth · September 29, 2014, 11:51am

Hi ,

Sorry .Sending posts from phone caused these ugly replies.

Jörg , can you please look to this question , when you find time ?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/689336df-d59e-4d6c-b7fd-f0b9c0c31709%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · September 29, 2014, 12:24pm

If you sort on a field with 80.6 mio unique values, ES will load these
values into RAM and sort on them.

"packedint" feature of Lucene is not important here, they are designed for
high frequency terms

Jörg

On Sat, Sep 27, 2014 at 4:34 AM, Anantha Govindarajan <
ananthagovindarajan@gmail.com> wrote:

Hi Jörg, We are dealing
with logs . If user debugs his code through logs I need apply sorting on
HHmmssSSS field . yyyyMMdd & HHmmssSSS both are int fields. How costly
applying sort on HHmmssSSS (80.6 million unique values) field? I am curious
to know whether lucene's packetInt plays a role here ? Or number of unique
values * 4 bytes ?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEX_wjryy3Ne1CDC20pWKUQaKUK%3Dz679BdyRyyo5EmOMmfsoHg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAEX_wjryy3Ne1CDC20pWKUQaKUK%3Dz679BdyRyyo5EmOMmfsoHg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFHjaPgKbEnuEfCSe4iZP_-ixjC6w0V%2BgZycjNdQ9yMqg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

ananth · September 30, 2014, 4:07am

Hi Jörg,

Thanks for replying !

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/07ac663f-1848-47f1-b416-20b526891b95%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Null scores on sorted query? Elasticsearch	2	2044	July 6, 2017
Sorting by score range and date Elasticsearch	3	5750	July 6, 2017
Change past dates to today's date while sorting in ElasticSearch Elasticsearch	4	1507	May 6, 2017
How to sort on small cluster with 100m+ documents without OOME Elasticsearch	2	293	July 6, 2017
Sorting a string field numerically Elasticsearch	15	16320	July 6, 2017

More efficient date sorting

Related Topics