How to create my own document_id in logstash?

elastic_paul · May 27, 2015, 6:42pm

I would like to create my own document_id to avoid duplication.

I would like to make the document_id as an MD5 hash of two fields; "ip" and "sha1_fingerprint".

eg; in pseudo code:

md5_hex( "ip" + " sha1_fingerprint" )

Thanks

suyograo · May 28, 2015, 1:12am

You can first inject a field called computed_id using a ruby filter: https://www.elastic.co/guide/en/logstash/current/plugins-filters-ruby.html

So, in ruby filter you can do: event[computed_id] => Digest::MD5.hexdigest('event[ip] + sha1'). After assigning this to a computed_field you can use it in ES output document_id => "%{computed_field}"

elastic_paul · May 28, 2015, 5:49am

Thanks very much for this. Just what I wanted. Just to help anyone finding this question, here is the code I actually used:

ruby {
  code => "require 'digest/md5';
  event['computed_id'] = Digest::MD5.hexdigest(event['ip'] + event['sha1_fingerprint'])"
}

Then

document_id => "%{computed_id}"

Can I ask two more related questions:

How can I remove the field after I have set the document_id? I don't want it in my stored data. eg; remove event['computed_id']
Its seems that my index is BIGGER doing it this way? Any ideas? I thought that deduplication would save space? It can't be because of the extra 'computed_id' field can it?

Thanks

magnusbaeck · May 28, 2015, 7:02am

How can I remove the field after I have set the document_id? I don't want it in my stored data. eg; remove event['computed_id']

If you're using Logstash 1.5 you can store the computed id field as a subfield of the @metadata field. None of those fields propagate to outputs.

Its seems that my index is BIGGER doing it this way? Any ideas? I thought that deduplication would save space? It can't be because of the extra 'computed_id' field can it?

Well, to what extent are you actually deduplicating events? The computed id field (obviously) only have unique values so they'll add a lot of terms to the index.

elastic_paul · May 28, 2015, 7:52am

If you're using Logstash 1.5 you can store the computed id field as a subfield of the @metadata field. None of those fields propagate to outputs.

That is exactly the function I am after thank you. Can you just help me with the syntax though?

I currently use:

event['computed_id'] = Digest::MD5.hexdigest(event['ip'] + event['sha1_fingerprint'])

So will that become:

@metadata[md5] = Digest::MD5.hexdigest(event['ip'] + event['sha1_fingerprint'])

That doesn't seem to work - do I need to do a mutate to add @metadata[md5] first? I don't find the syntax very easy.
Thanks again

warkolm · May 28, 2015, 9:15am

The document ID in ES is going to be pretty damn unique anyway, you might be reinventing the wheel here!

elastic_paul · May 28, 2015, 9:47am

HI, I am trying to deduplicate the records.

I have lots of IP + SHA1_fingerprints that are all the same apart from timestamp, and I don't want or need them. So my thought was to use a hash of (IP + SHA1_fingerprint) as the doc_id to only save one of the records. Does that make sense?
eg;
time1 IP1 SHA1_1
time2 IP1 SHA1_1
time3 IP1 SHA1_1

Don't need or want 3 docs so only save one ?

Meanwhile I am completely stuck on trying to make use the @metadata field with the ruby code. Could someone be kind enough to help a beginner out and change my posted code to use @metafield? I just can not get the syntax.

Thanks

magnusbaeck · May 29, 2015, 9:43am

@metadata[md5] = Digest::MD5.hexdigest(event['ip'] + event['sha1_fingerprint'])

@metadata is a normal message field (except that it doesn't propagate to outputs) so this is what you're looking for:

event['@metadata']['md5'] = Digest::MD5.hexdigest(event['ip'] + event['sha1_fingerprint'])

elastic_paul · May 29, 2015, 10:56am

Perfect thanks.

I also had to edit the elasticsearch plugin to accept an HTTP code 409 when using the create command. See my other thread for that one.

tomr · June 26, 2015, 10:02am

Is there a good reason not to use the computed hash as the document _id ?

I'm looking at doing the same thing, and IIUC using the hash as _id would allow me to use op_type to reduce the cost of a log replay.

warkolm · June 26, 2015, 10:05am

Depends, why would you want to replay?

tomr · June 26, 2015, 10:13am

In short, unreliable log-delivery.

warkolm · June 26, 2015, 10:19am

Might be easier to do this in your other thread

tomr · June 26, 2015, 11:08am

I hope you appreciate the irony

For reference, this now works for me.
in the filter section:

ruby {
  code => "require 'digest/md5';
  event['@metadata']['computed_id'] = Digest::MD5.hexdigest(event['message'])"
}

and in my ES output section:

document_id => "%{[@metadata][computed_id]}"

thanks all
t

fninja · December 1, 2015, 10:07am

this so needs to be come a plugin.
@warcolm
how can one use the logstash-generated document id to avoid duplication, if it gets generated anew every time you index a document?

So if you index "document a" 3 times it will receive a different document id upon each time.

Now if you create a hash out of "document a" three times, this hash will be the same three times.
Great feature for replaying logs gracefully.

Christian_Dahlqvist · December 1, 2015, 10:28am

There are already two plugins that you can use for this and avoid having to use the Ruby filter:

The checksum filter plugin is labelled as experimental, but allows you to specify which fields that you want to include in the hash calculation, so you could choose to exclude the timestamp field as in your example.

The fingerprint filter plugin seems less experimental, but only supports specifying a single field as input. You would therefore need to concatenate the fields you want to hash into a single field, possibly under '@metadata' before running this.

sjivan · November 17, 2016, 10:30pm

Hi,
What's the LS 5.0 equivalent of the following since direct access of event fields is not longer allowed?

event['@metadata']['myfield'] = 'foo'

I tried

event.get('@metadata').set('myfield', 'foo')

but I get the following exception

Ruby exception occurred: undefined methodset' for #Hash:0x4a52075d`

Thanks,
Sanjiv

magnusbaeck · November 18, 2016, 6:30am

See https://www.elastic.co/guide/en/logstash/current/event-api.html#_event_api. Please start a new thread if you have any follow-up questions.

EthanStark · June 7, 2017, 11:19am

output {
if [document_id] {
elasticsearch_http {
host => "127.0.0.1"
document_id => "%{document_id}"
}
} else {
elasticsearch_http {
host => "127.0.0.1"
}
}
}

Topic		Replies	Views
Using my own document_id - is there a faster way? Logstash	26	4202	December 27, 2017
How to create custom document _id in logstash? Logstash	7	10733	February 20, 2018
_id as a consistent hash to avoid duplication with replays Logstash	2	1075	July 6, 2017
Use uuid or fingerprint for document_id? Logstash	3	1955	July 6, 2017
Ingesting node Logstash	3	658	March 23, 2017

How to create my own document_id in logstash?

Related topics