Generate a hash/CRC for part of a document?

I need to generate a unique (or almost-certainly-unique) hash/CRC value for part of my ElasticSearch document. The sub-document being hashed can contain arbitrary data types, nested documents, and lists where order is unimportant, so generating a canonical hash which is identical for every equivalent document representation is not trivial.

Is there any mod, utility, or best-practice documentation that does this task or discusses how to do this task well? The hash could be computed and updated by the ElasticSearch server when the document is updated, or I could compute it myself (as a property of Java nested HashMap) and send it as a regular attribute.

The reason I need to do this is I’m denormalizing pieces of my periodically-updated document into other documents e.g. the reference attributes of document ID 123 might be copied as a nested document into 50 different other documents of an index with 25 million root documents.
I’ll need to periodically search for and update any outdated reference to document 123 that has a non-matching hash value.

Unfortunately for complicated reasons I can’t easily generate a simple version number to track the changes.

Hey,

so there is a painless sha1() function. You could use a script processor and concatenate all the required field values into a single string and then run that sha1 function on it and store the returned hash in a dedicated field.

--Alex

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.