Getting max value of a long field without losing precision

trejkaz · July 11, 2017, 5:41am

I want to get the maximum used ID from a field which stores incremental long IDs, but the max aggregation returns double.

This has been brought up before...

The suggestion to cast back to long won't be reliable because I can't easily tell if I have lost precision.

If the API would instead return the max as a long in the first place, that would be ideal.

If you absolutely have to support floating point values as well as longs... maybe BigDecimal is a better option?

polyfractal · July 13, 2017, 12:19am

Yeah, this is unfortunately a hard limitation on the agg framework right now. All aggs convert the values to a double before operating on them. I'm not sure if that'll change, at least easily... it's a pretty fundamental part of how the agg framework operates.

Before anything else, I think it should be noted that losing precision usually isn't a problem for IDs and the like. 64bit doubles have 52 bit mantissa's, so the maximum non-float numeric you can store is 2^53. See this SO answer for a breakdown on why. So that gives you values 0-9 quadrillion (roughly) that can be stored without loss of precision. For most folks that's fine, but if you expect to have a max > 2^53 then it's a concern.

As a workaround, you could do a search with size:1 and sort by the value descending, which will give you the document with the highest value, and you can extract it from the source. Not a great solution, since it only really applies to min/max and not the other aggs, but it may help.

We've stayed away from BigDecimal so far because it's just really, really slow. The overhead of using it is just a no-go for performant aggs. If/when we want support for Longs (or other non-floats) I think we'd probably just integrate it as a framework feature... defaulting to non-float when applicable (min/max/sum is fine, avg never, etc)

trejkaz · July 13, 2017, 1:06am

2^52 does give you something like 4*10^15 IDs.

So you'd expect 15 digits of precision, right?

But unfortunately, we see in one of those posts I linked that Elasticsearch is formatting its JSON like this:

      "value" : 5.61380157E8

So you're losing a fair chunk of the potential precision because of a poor choice of transport data format (JSON).

polyfractal · July 20, 2017, 6:01pm

Scientific notation doesn't (necessarily) mean that you are losing precision.

5.61380157E8 == 5.61380157 * 10⁸ == 561380157.0

561380157.0 is less than 2⁵³ (9007199300000000), so assuming you believe the SO answer that all values < 2⁵³ can be represented accurately, the number provided is 100% precise without any floating point errors.

trejkaz · July 20, 2017, 8:59pm

I do realise that particular value doesn't, but I worry that if it's already breaking out scientific notation for a mere 9 digits, what will happen with 10, or 15...

You'd think that simply not formatting it like that would produce less questions. "Principle of least surprise", and all that.

polyfractal · July 20, 2017, 9:27pm

No arguments there. The blame falls to Oracle though, not JSON per-se. When Java prints doubles to string, it converts to scientific notation above a certain threshold.

If m is less than 10^-3 or greater than or equal to 10^7, then it is represented in so-called "computerized scientific notation."

You should be safe to do a simple check for any value < 2⁵³ and conclude it's the actual, non-rounded value. And anything greater is potentially problematic. Not ideal I agree, but it is what it is.

FWIW, there is an older ticket about this exact issue: Aggregations for long and date values · Issue #9545 · elastic/elasticsearch · GitHub

The resolution was that it wasn't possible to fix cleanly without breaking compatibility for the java client. But now that the Java REST client is nearly done, and the transport client will be going away, we may be able to revisit this.

Akhil_Prashar · July 21, 2017, 6:08am

Hi all, actually I am trying to search a keyword in range query between particular dates. Can anyone help me?
curl -XGET 'http://localhost:9200/201706/_search' -d '{ "query" :{ "bool" : { "must" : [{ "match" : { "message": "isis"} }, { "range" :{ "created_at" :{ "gte" :"2017-06-01", "lte" : "2017-06-30", "format" :"yyyy/mm/dd||yyyy"}}}]}}}'

system · August 18, 2017, 6:08am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Java REST client returning long value in Exponential Elasticsearch	5	1451	October 23, 2017
Metrics aggregation gives unreasonable result on long numeric datatype Elasticsearch	3	504	April 9, 2020
Elasticsearch aggregation performance on long vs float/scaled_float/double Elasticsearch	1	586	December 1, 2019
Aggregation on long type is wrong Elasticsearch	4	397	December 12, 2019
Corrent use of numeric type Long without losing precision Elasticsearch	1	314	July 9, 2020

Getting max value of a long field without losing precision

Related topics