How to create my own document_id in logstash?


(Paul) #1

I would like to create my own document_id to avoid duplication.

I would like to make the document_id as an MD5 hash of two fields; "ip" and "sha1_fingerprint".

eg; in pseudo code:

md5_hex( "ip" + " sha1_fingerprint" )

Thanks


(Suyog Rao) #2

You can first inject a field called computed_id using a ruby filter: https://www.elastic.co/guide/en/logstash/current/plugins-filters-ruby.html

So, in ruby filter you can do: event[computed_id] => Digest::MD5.hexdigest('event[ip] + sha1'). After assigning this to a computed_field you can use it in ES output document_id => "%{computed_field}"


(Paul) #3

Thanks very much for this. Just what I wanted. Just to help anyone finding this question, here is the code I actually used:

ruby {
  code => "require 'digest/md5';
  event['computed_id'] = Digest::MD5.hexdigest(event['ip'] + event['sha1_fingerprint'])"
}

Then

document_id => "%{computed_id}"

Can I ask two more related questions:

  1. How can I remove the field after I have set the document_id? I don't want it in my stored data. eg; remove event['computed_id']

  2. Its seems that my index is BIGGER doing it this way? Any ideas? I thought that deduplication would save space? It can't be because of the extra 'computed_id' field can it?

Thanks


Using my own document_id - is there a faster way?
(Magnus Bäck) #4
  1. How can I remove the field after I have set the document_id? I don't want it in my stored data. eg; remove event['computed_id']

If you're using Logstash 1.5 you can store the computed id field as a subfield of the @metadata field. None of those fields propagate to outputs.

  1. Its seems that my index is BIGGER doing it this way? Any ideas? I thought that deduplication would save space? It can't be because of the extra 'computed_id' field can it?

Well, to what extent are you actually deduplicating events? The computed id field (obviously) only have unique values so they'll add a lot of terms to the index.


(Paul) #5

If you're using Logstash 1.5 you can store the computed id field as a subfield of the @metadata field. None of those fields propagate to outputs.

That is exactly the function I am after thank you. Can you just help me with the syntax though?

I currently use:

event['computed_id'] = Digest::MD5.hexdigest(event['ip'] + event['sha1_fingerprint'])

So will that become:

@metadata[md5] = Digest::MD5.hexdigest(event['ip'] + event['sha1_fingerprint'])

That doesn't seem to work - do I need to do a mutate to add @metadata[md5] first? I don't find the syntax very easy.
Thanks again


(Mark Walkom) #6

The document ID in ES is going to be pretty damn unique anyway, you might be reinventing the wheel here!


(Paul) #7

HI, I am trying to deduplicate the records.

I have lots of IP + SHA1_fingerprints that are all the same apart from timestamp, and I don't want or need them. So my thought was to use a hash of (IP + SHA1_fingerprint) as the doc_id to only save one of the records. Does that make sense?
eg;
time1 IP1 SHA1_1
time2 IP1 SHA1_1
time3 IP1 SHA1_1

  • Don't need or want 3 docs so only save one ?

Meanwhile I am completely stuck on trying to make use the @metadata field with the ruby code. Could someone be kind enough to help a beginner out and change my posted code to use @metafield? I just can not get the syntax.

Thanks


(Magnus Bäck) #8

@metadata[md5] = Digest::MD5.hexdigest(event['ip'] + event['sha1_fingerprint'])

@metadata is a normal message field (except that it doesn't propagate to outputs) so this is what you're looking for:

event['@metadata']['md5'] = Digest::MD5.hexdigest(event['ip'] + event['sha1_fingerprint'])

(Paul) #9

Perfect thanks.

I also had to edit the elasticsearch plugin to accept an HTTP code 409 when using the create command. See my other thread for that one.


_id as a consistent hash to avoid duplication with replays
_id as a consistent hash to avoid duplication with replays
#10

Is there a good reason not to use the computed hash as the document _id ?

I'm looking at doing the same thing, and IIUC using the hash as _id would allow me to use op_type to reduce the cost of a log replay.


(Mark Walkom) #11

Depends, why would you want to replay?


#12

In short, unreliable log-delivery.


(Mark Walkom) #13

Might be easier to do this in your other thread :slight_smile:


#14

I hope you appreciate the irony :smile:

For reference, this now works for me.
in the filter section:

ruby {
  code => "require 'digest/md5';
  event['@metadata']['computed_id'] = Digest::MD5.hexdigest(event['message'])"
}

and in my ES output section:

document_id => "%{[@metadata][computed_id]}"

thanks all
t


#15

this so needs to be come a plugin.
@warcolm
how can one use the logstash-generated document id to avoid duplication, if it gets generated anew every time you index a document?

So if you index "document a" 3 times it will receive a different document id upon each time.

Now if you create a hash out of "document a" three times, this hash will be the same three times.
Great feature for replaying logs gracefully.


(Christian Dahlqvist) #16

There are already two plugins that you can use for this and avoid having to use the Ruby filter:

The checksum filter plugin is labelled as experimental, but allows you to specify which fields that you want to include in the hash calculation, so you could choose to exclude the timestamp field as in your example.

The fingerprint filter plugin seems less experimental, but only supports specifying a single field as input. You would therefore need to concatenate the fields you want to hash into a single field, possibly under '@metadata' before running this.


(Sanjiv Jivan) #17

Hi,
What's the LS 5.0 equivalent of the following since direct access of event fields is not longer allowed?

event['@metadata']['myfield'] = 'foo'

I tried

event.get('@metadata').set('myfield', 'foo')

but I get the following exception

Ruby exception occurred: undefined methodset' for #Hash:0x4a52075d`

Thanks,
Sanjiv


(Magnus Bäck) #18

See https://www.elastic.co/guide/en/logstash/current/event-api.html#_event_api. Please start a new thread if you have any follow-up questions.


(Ethan Stark) #19

output {
if [document_id] {
elasticsearch_http {
host => "127.0.0.1"
document_id => "%{document_id}"
}
} else {
elasticsearch_http {
host => "127.0.0.1"
}
}
}