Murmur3 values differ between ES 1.7 and 6.4

We're in the process of finally migrating from ES 1.7, and I noticed a strange inconsistency between the hashed values of Murmur3 fields between the two versions. Specifically, the hashes do not match in the last byte. This is particularly strange because I've diffed the MurmurHash3.java file in 1.7 and 6.4 and found only a minor formatting difference.

Furthermore, I have called the MurmurHash3 code directly in testing and found that it produces results that match the C++ reference implementation, but the stored value in ES 1.7 is slightly off. The value in ES 6.4 matches what is expected.

For example, given the string "db030357-7a16-41c0-b69a-02a12299f90f", the output of calling hash128 directly is -1884620459626981620, but the stored hashed value in 1.7 is -1884620459626981600 (note the 00 at the end instead of 20). The stored hashed value in 6.4 matches the expected.

Since we did not store _source on hashed fields in 1.7, this presents a bit of a problem. We were hoping to just convert the fields to long for 6.4 and do a simple reindex (with some massaging to extract the fielddata_fields and reformat them), but we need to be able to do the hashing in a consistent way in the future.

If anyone knows what is actually changing the values slightly in ES 1.7, I can then replicate that difference in our code that writes new data to the new 6.4 indices. I'll keep digging around, but there is nothing obviously modifying the value where hash128 is called.

I should also note it's not off by a constant value of -20. I've seen differences of up to ±255, which is why I think it's some corruption of the final byte.

One other note: I have verified that the difference is consistent across multiple ES 1.7 clusters for the same inputs. That is, the value stored in 1.7 for the input string above is always -1884620459626981600.

How are you indexing the data? Are you sending the numbers as integers or strings? Is it possible that this is due to lack of precision in Javascript for large integers?

When we index the docs, the values to be hashed are strings. We index them via the official Python client.

There is no Javascript involved in our code, but even if there was the hashes are less than 21 digits long.

Are they formatted as strings or integers in the JSON document you are sending to Elasticsearch? Are they mapped as long in both versions?

They are formatted as strings. Here's an example payload:

{
    "apikey": "blog.parsely.com",
    "ts": "2018-10-26T14:03:18.073748+00:00",
    "ts_apikey": "2018-10-26T14:03:18.073748",
    "index_ts": "2018-10-26T14:03:18.073792+00:00",
    "freq": "5min",
    "url": "https://blog.parsely.com/",
    "visitors": [
        "db030357-7a16-41c0-b69a-02a12299f90f"
    ],
    "vis_new": [
        "db030357-7a16-41c0-b69a-02a12299f90f"
    ],
    "vis_returning": []
}

visitors, vis_new, and vis_returning are all Murmur3 fields.

Once stored in ES 1.7, it looks like this:

 {
    "_index": "v1_casterisk-5min-2018.10.26",
    "_type": "url",
    "_id": "https://blog.parsely.com/|blog.parsely.com|2018.10.26.14:03:18.073748|None|1",
    "_score": 1,
    "_source": {
        "index_ts": "2018-10-26T14:03:18.073792+00:00"
    },
    "fields": {
        "freq": [
            "5min"
        ],
        "_version": [
            1
        ],
        "_routing": "blog.parsely.com",
        "ts": [
            1540562598073
        ],
        "apikey": [
            "blog.parsely.com"
        ],
        "visitors": [
            -1884620459626981600
        ],
        "url": [
            "https://blog.parsely.com/"
        ],
        "ts_apikey": [
            1540562598073
        ],
        "vis_new": [
            -1884620459626981600
        ]
    }
}

Here's the relevant part of the mapping for 1.7

               "visitors": {
                  "type": "murmur3",
                  "index": "no",
                  "doc_values": true,
                  "fielddata": {
                     "format": "doc_values"
                  },
                  "precision_step": 2147483647,
                  "null_value": -1
               }

I ran this example in the Kibana Console:

PUT /test/doc/1
{
  "idx": 1884620459626981620
}

GET /test/_mapping

PUT /test/doc/2
{
  "idx": "1884620459626981620"
}

GET test/_search
{
    "query": {
        "range" : {
            "idx" : {
                "gte" : 1884620459626981620,
                "lte" : 1884620459626981620
            }
        }
    }
}

and it returned the following response:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "2",
        "_score": 1,
        "_source": {
          "idx": "1884620459626981620"
        }
      },
      {
        "_index": "test",
        "_type": "doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "idx": 1884620459626981600
        }
      }
    ]
  }
}

I have however never used the murmur3 mapping you showed, so am not sure how that behaves.

I'm not quite sure what to make of that. Was that with 1.7 or 6.4?

Since I'm only seeing the incorrect hashes with 1.7, the implication seems to be that something internal to ES has a precision issue in 1.7 that was fixed in 6.4.

The way the murmur3 mapping works you send it a string and then the string is hashed, and the hashed value is stored. The client never calculates or sends the hash value, so it must be an internal imprecision.

Also, your example seems to imply that one should always send numbers to ES as strings, which is a bit counter intuitive.

If I return the documents via curl I get the right value, so the issue in my example seems to be how the JavaScript in my browser decodes it. Do you get the same value if you fetch the document via Python and curl? Any difference between 1.7 and 6.4?

1 Like

Ohhhhh... you're totally right. I was checking values from our old cluster in Marvel and for our new cluster via the Python DSL. If I check the old version via Python as well, the number is the same. I feel very silly for not checking that before.

Thank you so much for walking me through this!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.