Using my own document_id - is there a faster way?

elastic_paul · November 22, 2017, 8:12am

Hi Everyone,
For years I have been using my own document _id to de-duplicate my data. The principle is that I use a ruby filter to create a unique hash of the IP address and the SHA fingerprint, and use this as the document_id. When I try to add to the index it will only add NEW ip+sha data and so de-dupe.
This was discussed in this old thread:

Works just fine but it is very slow because it checks every item before indexing. Does anyone have any better ideas on how to do this faster? A better design? A better workflow?

Actual code (updated for logstash 6.0)

    ruby {
  code => "require 'digest/md5';
  event.set( '[@metadata][computed_id]', Digest::MD5.hexdigest(event.get('[ip]') + event.get('[fingerprint_sha1]')))"
}

Christian_Dahlqvist · November 22, 2017, 8:13am

What kind of data are you indexing?

elastic_paul · November 22, 2017, 4:36pm

parsed SSL certificates in JSON.

Christian_Dahlqvist · November 22, 2017, 4:48pm

Don't think I have any ideas around how to make it faster. If you had a timestamp associated with the event you could create a fast and efficient identifier, but it seems you don't from the linked thread. Smaller shards and faster disks may help, but that is all I can think of right now.

elastic_paul · November 22, 2017, 5:17pm

I do have a timestamp, but I can't see how to use that to deduplicate?
What I want to achieve is to say "I have this certificate from this IP already , so don't add it again".
Its the (IP+SHA) that is the unique identifier to avoid duplication.

Here is a thought: Could I overwrite an existing entry within an index? If I get another identical (IP+SHA) could I simply add it with an overwrite so that there is only ever one copy? In other words ignore the timestamp and assume it was the same document?

This would save space and would stop me having to check every single document before adding?

Does that make any sense?

Christian_Dahlqvist · November 22, 2017, 5:19pm

I incorrectly assumed that was what you were doing as it is the most common way to handle it.

elastic_paul · November 22, 2017, 5:46pm

I'm not that is what I am doing. I think I have it set up to only allow one document and so it has to check very single time to make sure its unique, rather than just not checking and over writing.
Here is my warning if I turn on debug:

{"create"=>
  {"_index"=>"ssl-2017.11",
  "_type"=>"ssl",
  "_id"=>"%{[@metadata][computed_id]}",
  "status"=>409,
  "error"=>{
    "type"=>"document_already_exists_exception",
    "reason"=>"[ssl][%{[@metadata][computed_id]}]: document already exists",
    "shard"=>"2", "index"=>"ssl-2017.11"}
  } 
}

Rather than generate that warning, which implies it is checking for unique doc_id, could I not check and just overwrite? As I am not doing that now, what would I configure to allow that?

Christian_Dahlqvist · November 22, 2017, 5:52pm

If you set the action to create, you will get a failure. If you instead set it to index it will overwrite. In both cases Elasticsearch have to check if the document already exists in the shard, so it may not makes much difference.

elastic_paul · November 22, 2017, 6:06pm

Thanks for the reply.
So it looks like the only way for me to speed this up might be to preprocess myself to deduplicate because doing this at scale in ES is too expensive (eg; slow)?

Christian_Dahlqvist · November 22, 2017, 6:30pm

What throughput are you seeing? What is the specification of your Elasticsearch cluster? How much data do you have? How large are your shards?

elastic_paul · November 22, 2017, 7:34pm

Can I pull some stats off the cluster to show the state of this indexing job?
Thanks

Christian_Dahlqvist · November 22, 2017, 7:44pm

What does your config look like?

elastic_paul · November 23, 2017, 6:35am

This is all pilot/testing based.
I have 3 servers in the cluster (normally 4 but one is de-commissioned at the moment)
4 shards (normally 1 per server)
Servers have 128GB RAM, 36TB disk, 12 cores with hyperthreading

Current index (this is the new one for testing that I am currently trying to improve) is 163GB in size with 43 million documents.

Here is some data from the elasticsearch-head plugin:

 {
"primaries": {
"docs": {
"count": 45010871,
"deleted": 0
},
"store": {
"size_in_bytes": 174966759003,
"throttle_time_in_millis": 0
},
"indexing": {
"index_total": 45014542,
"index_time_in_millis": 367659124,
"index_current": 288889105,
"index_failed": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 402
},

"merges": {
"current": 1,
"current_docs": 37787,
"current_size_in_bytes": 130716622,
"total": 15726,
"total_time_in_millis": 357211935,
"total_docs": 151162909,
"total_size_in_bytes": 723163025880,
"total_stopped_time_in_millis": 144355699,
"total_throttled_time_in_millis": 98852115,
"total_auto_throttle_in_bytes": 20971520
},
"refresh": {
"total": 10286,
"total_time_in_millis": 11816106
},
"flush": {
"total": 888,
"total_time_in_millis": 3204982
},

"segments": {
"count": 171,
"memory_in_bytes": 385503425,
"terms_memory_in_bytes": 355067245,
"stored_fields_memory_in_bytes": 28580864,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 1002304,
"doc_values_memory_in_bytes": 853012,
"index_writer_memory_in_bytes": 46442972,
"index_writer_max_memory_in_bytes": 2147483648,
"version_map_memory_in_bytes": 617064,
"fixed_bit_set_memory_in_bytes": 0
},
"translog": {
"operations": 119174,
"size_in_bytes": 409953585
},

"total": {
"docs": {
"count": 45010871,
"deleted": 0
},

Christian_Dahlqvist · November 23, 2017, 7:20am

Based on the disk volume I would guess you are using spinning disks. SSDs might be able to handle this type of load better. In my experience the indexing thoughput typically drop with shard size, so you may be better off with a larger number of smaller shards. When determining the ideal number, you naturally need to balance the requirements of query and indexing load and find your optimum number.

elastic_paul · November 23, 2017, 7:36am

Yes I am using spinning disks.
Happy to try increasing the number of shards just to try it out. What's a reasonable number to aim for?
I'll give it a go

My query levels are very very low - its just research. Ingesting/indexing is my main priority/activity/

Christian_Dahlqvist · November 23, 2017, 7:50am

If you are using multiple paths, I would set it so that each disk gets a shard. If you have raided the storage I would increase it by perhaps a factor of 4 or 6 to see if it makes a difference. If it helps you can potentially try somewhat larger numbers, but I would not go crazy as it may have an impact on query performance.

elastic_paul · November 23, 2017, 7:56am

I am using multiple paths so a shard per disk makes sense so I will give that a go. Thanks

elastic_paul · November 23, 2017, 8:00am

I have 12 disks with 12 mount points. If I configure 12 shards will ES sort that out or do I need to do anything manual to have a shard per disk?

elastic_paul · November 27, 2017, 10:14am

I don't think this is working any more with Logstash and ES at 6.0.
I can't see the 'document_id' being used and can only see _id in the index. Can you spot what has changed to stop this working?

    ruby {
      code => "require 'digest/md5';
      event.set( '[@metadata][computed_id]', Digest::MD5.hexdigest(event.get('[ip]') + event.get('[fingerprint_sha1]')))"
    }


      elasticsearch {
        user => xxxx
        password => xxxx
        action => "index"
        document_id => "%{[@metadata][computed_id]}"
        hosts => ["hdp13","hdp14","hdp15","hdp16"]
        index => "new_ssl-%{+YYYY.MM}"
        manage_template => false
        template_name => new_ssl1
        template_overwrite => true
      }

Christian_Dahlqvist · November 27, 2017, 10:41am

Why are you using the ruby filter to set this instead of the fingerprint filter?

What do the ids look like when using this config?

Topic		Replies	Views
How to create my own document_id in logstash? Logstash	19	26591	November 4, 2022
Use uuid or fingerprint for document_id? Logstash	3	1955	July 6, 2017
Logstash->Elasticsearch document deduplication efficiency & optimization Elasticsearch	3	786	December 12, 2020
How to use document id to avoid duplication of logs? Logstash	8	1871	June 26, 2020
How to create unique id in logstash/elastic search for apache logs in a distributed server environment Logstash	10	3968	January 17, 2019

Using my own document_id - is there a faster way?

Related topics