Ingest pipeline: _id generation

is it safe to generate doc IDs in a pipeline as following:

  {
    "script": {
      "ignore_failure": false,
      "lang": "painless",
      "source": "ctx._id = 'prfx-'.concat(ctx.message.hashCode().toString())"
    }
  }

or is it better to use UUID:

  {
    "script": {
      "ignore_failure": false,
      "lang": "painless",
      "source": """
      	 char[] buffer = ctx.message.toCharArray();
      	 byte[] b = new byte[buffer.length];
      	 for (int i = 0; i < b.length; i++) {
      	  b[i] = (byte) buffer[i];
      	 }
          ctx._id= UUID.nameUUIDFromBytes(b).toString();
      """
    }
  }

both approaches can lead to duplicate IDs and overwrite data. is this something you are ok with? Could you use the automate ID generation within Elasticsearch?

I need to re-index some pieces of data quite often, that's why I'd like to compute a hash based on the whole message in order to avoid duplicates.
Of course, I don't want to overwrite data by creating non-unique IDs too.
Do you know, if there are other hash-methods available in painless API?

I ended up with a "paranoid" ID generator - _id = MD5(text) + "." + MD5(base64(text)):

Stored script generate_id:

POST _scripts/generate_id
{
  "script": {
	"lang": "painless",
	"source": """
		// function getUUID
		String getUUID(def str) {
		  def res = null;
		  if (str != null) {
			 char[] buffer = str.toCharArray();
			 byte[] b = new byte[buffer.length];
			 for (int i = 0; i < b.length; i++) {
			  b[i] = (byte) buffer[i];
			 }
			 res = UUID.nameUUIDFromBytes(b).toString();
		  } else {
			  // randomUUID does not work
			  //res = UUID.randomUUID().toString();
			  res = Math.random().toString();
		  }
		  return res;
		}
		
		//
		// Main
		// params.field - the field name.
		// Note: "doted" names (like "content.message") will not work.
		//
		if (ctx.containsKey(params.field) && !ctx[params.field].isEmpty()) {
		  ctx._id = getUUID(ctx[params.field]) + "." + getUUID(ctx[params.field].encodeBase64());
		} else {
		  ctx._id = params.field + "_empty_" + getUUID(null);
		}
	"""
  }
}

how to call it in a pipeline:

"processors": [
...
{
    "script": {
      "ignore_failure": false,
        "id": "generate_id",
        "params": {
             "field": "message"
        }
    }
},
...
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.