Terms aggregations on hashcodes (Murmur3FieldMapper)

Hello,

I have an index (4 shards and 1 replica each) which has high ingestion rate (5k per s) distributed among 5 data nodes.

The problem I see is that when a terms aggregation is run on this index, the response time is very high.

Mapping:
"recipients": { "type": "string", "index": "not_analyzed", "fields": { "hash": { "type": "murmur3" } } },

Aggregation query

"aggregations": { "bucket_agg": { "terms": { "field": "recipients.hash", "size": 5, "shard_size": 0, } } }

But aggregating on hash code is way faster, which is expected as it hashcode fields are not strings and do not need global ordinals to be updated in field data on heavily indexed index.

My problem is that if I query on hash code the value returned by the aggregate is a hashcode, which I am not able to map to the hash code generated from the original string. I used Mapper code to generate hashcode. Can you please let me know if elasticsearch pads or does more optimization to the hash code returned in aggregation?

Thanks!

Looks like the hash code returned as part of the aggregation has 0s padded to the right

Returned by aggregation: 7532129326328174000
Returned by murmur3: 7532129326328173534

Why and how is the the number rounded up?

I dunno the answer to this one - I'm not super familiar with aggregations. Rounding a hash code is genuinely weird.

You'll probably have more luck if you just ask the question rather than ping someone directly.

Np.

Apologies. I just added you because I saw you as one of the committers of Mapper code.

Thanks again.

This turns out is a bug with elasticsearch sense plugin, which rounded off the response hashcodes.