Ingest pipeline: _id generation

jetnet · September 19, 2018, 8:11pm

is it safe to generate doc IDs in a pipeline as following:

  {
    "script": {
      "ignore_failure": false,
      "lang": "painless",
      "source": "ctx._id = 'prfx-'.concat(ctx.message.hashCode().toString())"
    }
  }

jetnet · September 19, 2018, 8:46pm

or is it better to use UUID:

  {
    "script": {
      "ignore_failure": false,
      "lang": "painless",
      "source": """
      	 char[] buffer = ctx.message.toCharArray();
      	 byte[] b = new byte[buffer.length];
      	 for (int i = 0; i < b.length; i++) {
      	  b[i] = (byte) buffer[i];
      	 }
          ctx._id= UUID.nameUUIDFromBytes(b).toString();
      """
    }
  }

spinscale · September 20, 2018, 11:57am

both approaches can lead to duplicate IDs and overwrite data. is this something you are ok with? Could you use the automate ID generation within Elasticsearch?

jetnet · September 20, 2018, 1:45pm

I need to re-index some pieces of data quite often, that's why I'd like to compute a hash based on the whole message in order to avoid duplicates.
Of course, I don't want to overwrite data by creating non-unique IDs too.
Do you know, if there are other hash-methods available in painless API?

jetnet · September 21, 2018, 9:28am

I ended up with a "paranoid" ID generator - _id = MD5(text) + "." + MD5(base64(text)):

Stored script generate_id:

POST _scripts/generate_id
{
  "script": {
	"lang": "painless",
	"source": """
		// function getUUID
		String getUUID(def str) {
		  def res = null;
		  if (str != null) {
			 char[] buffer = str.toCharArray();
			 byte[] b = new byte[buffer.length];
			 for (int i = 0; i < b.length; i++) {
			  b[i] = (byte) buffer[i];
			 }
			 res = UUID.nameUUIDFromBytes(b).toString();
		  } else {
			  // randomUUID does not work
			  //res = UUID.randomUUID().toString();
			  res = Math.random().toString();
		  }
		  return res;
		}
		
		//
		// Main
		// params.field - the field name.
		// Note: "doted" names (like "content.message") will not work.
		//
		if (ctx.containsKey(params.field) && !ctx[params.field].isEmpty()) {
		  ctx._id = getUUID(ctx[params.field]) + "." + getUUID(ctx[params.field].encodeBase64());
		} else {
		  ctx._id = params.field + "_empty_" + getUUID(null);
		}
	"""
  }
}

how to call it in a pipeline:

"processors": [
...
{
    "script": {
      "ignore_failure": false,
        "id": "generate_id",
        "params": {
             "field": "message"
        }
    }
},
...

system · October 19, 2018, 9:28am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch Pipeline - Hash Message with sha1 Elasticsearch	1	1508	January 9, 2018
Accessing _id in ingest pipeline Elasticsearch	7	2323	May 10, 2019
Avoid duplicates via node ingest pipelines Logs	10	2061	December 3, 2021
How to create a new field in a doc using scripted fields and ingest pipeline Elasticsearch	2	639	April 9, 2020
Exactly-once guarantee for Spark Structured Streaming Elasticsearch es-hadoop	3	1340	October 21, 2019

Ingest pipeline: _id generation

Related topics