is it safe to generate doc IDs in a pipeline as following:
{
"script": {
"ignore_failure": false,
"lang": "painless",
"source": "ctx._id = 'prfx-'.concat(ctx.message.hashCode().toString())"
}
}
is it safe to generate doc IDs in a pipeline as following:
{
"script": {
"ignore_failure": false,
"lang": "painless",
"source": "ctx._id = 'prfx-'.concat(ctx.message.hashCode().toString())"
}
}
or is it better to use UUID:
{
"script": {
"ignore_failure": false,
"lang": "painless",
"source": """
char[] buffer = ctx.message.toCharArray();
byte[] b = new byte[buffer.length];
for (int i = 0; i < b.length; i++) {
b[i] = (byte) buffer[i];
}
ctx._id= UUID.nameUUIDFromBytes(b).toString();
"""
}
}
both approaches can lead to duplicate IDs and overwrite data. is this something you are ok with? Could you use the automate ID generation within Elasticsearch?
I need to re-index some pieces of data quite often, that's why I'd like to compute a hash based on the whole message in order to avoid duplicates.
Of course, I don't want to overwrite data by creating non-unique IDs too.
Do you know, if there are other hash-methods available in painless API?
I ended up with a "paranoid" ID generator - _id = MD5(text) + "." + MD5(base64(text))
:
Stored script generate_id
:
POST _scripts/generate_id
{
"script": {
"lang": "painless",
"source": """
// function getUUID
String getUUID(def str) {
def res = null;
if (str != null) {
char[] buffer = str.toCharArray();
byte[] b = new byte[buffer.length];
for (int i = 0; i < b.length; i++) {
b[i] = (byte) buffer[i];
}
res = UUID.nameUUIDFromBytes(b).toString();
} else {
// randomUUID does not work
//res = UUID.randomUUID().toString();
res = Math.random().toString();
}
return res;
}
//
// Main
// params.field - the field name.
// Note: "doted" names (like "content.message") will not work.
//
if (ctx.containsKey(params.field) && !ctx[params.field].isEmpty()) {
ctx._id = getUUID(ctx[params.field]) + "." + getUUID(ctx[params.field].encodeBase64());
} else {
ctx._id = params.field + "_empty_" + getUUID(null);
}
"""
}
}
how to call it in a pipeline:
"processors": [
...
{
"script": {
"ignore_failure": false,
"id": "generate_id",
"params": {
"field": "message"
}
}
},
...
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.