Using my own document_id - is there a faster way?

Hi Everyone,
For years I have been using my own document _id to de-duplicate my data. The principle is that I use a ruby filter to create a unique hash of the IP address and the SHA fingerprint, and use this as the document_id. When I try to add to the index it will only add NEW ip+sha data and so de-dupe.
This was discussed in this old thread:

Works just fine but it is very slow because it checks every item before indexing. Does anyone have any better ideas on how to do this faster? A better design? A better workflow?

Actual code (updated for logstash 6.0)

    ruby {
  code => "require 'digest/md5';
  event.set( '[@metadata][computed_id]', Digest::MD5.hexdigest(event.get('[ip]') + event.get('[fingerprint_sha1]')))"
}

What kind of data are you indexing?

parsed SSL certificates in JSON.

Don't think I have any ideas around how to make it faster. If you had a timestamp associated with the event you could create a fast and efficient identifier, but it seems you don't from the linked thread. Smaller shards and faster disks may help, but that is all I can think of right now.

I do have a timestamp, but I can't see how to use that to deduplicate?
What I want to achieve is to say "I have this certificate from this IP already , so don't add it again".
Its the (IP+SHA) that is the unique identifier to avoid duplication.

Here is a thought: Could I overwrite an existing entry within an index? If I get another identical (IP+SHA) could I simply add it with an overwrite so that there is only ever one copy? In other words ignore the timestamp and assume it was the same document?

This would save space and would stop me having to check every single document before adding?

Does that make any sense?

I incorrectly assumed that was what you were doing as it is the most common way to handle it.

I'm not that is what I am doing. I think I have it set up to only allow one document and so it has to check very single time to make sure its unique, rather than just not checking and over writing.
Here is my warning if I turn on debug:

{"create"=>
  {"_index"=>"ssl-2017.11",
  "_type"=>"ssl",
  "_id"=>"%{[@metadata][computed_id]}",
  "status"=>409,
  "error"=>{
    "type"=>"document_already_exists_exception",
    "reason"=>"[ssl][%{[@metadata][computed_id]}]: document already exists",
    "shard"=>"2", "index"=>"ssl-2017.11"}
  } 
}

Rather than generate that warning, which implies it is checking for unique doc_id, could I not check and just overwrite? As I am not doing that now, what would I configure to allow that?

If you set the action to create, you will get a failure. If you instead set it to index it will overwrite. In both cases Elasticsearch have to check if the document already exists in the shard, so it may not makes much difference.

Thanks for the reply.
So it looks like the only way for me to speed this up might be to preprocess myself to deduplicate because doing this at scale in ES is too expensive (eg; slow)?

What throughput are you seeing? What is the specification of your Elasticsearch cluster? How much data do you have? How large are your shards?

Can I pull some stats off the cluster to show the state of this indexing job?
Thanks

What does your config look like?

This is all pilot/testing based.
I have 3 servers in the cluster (normally 4 but one is de-commissioned at the moment)
4 shards (normally 1 per server)
Servers have 128GB RAM, 36TB disk, 12 cores with hyperthreading

Current index (this is the new one for testing that I am currently trying to improve) is 163GB in size with 43 million documents.

Here is some data from the elasticsearch-head plugin:

 {
"primaries": {
"docs": {
"count": 45010871,
"deleted": 0
},
"store": {
"size_in_bytes": 174966759003,
"throttle_time_in_millis": 0
},
"indexing": {
"index_total": 45014542,
"index_time_in_millis": 367659124,
"index_current": 288889105,
"index_failed": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 402
},

"merges": {
"current": 1,
"current_docs": 37787,
"current_size_in_bytes": 130716622,
"total": 15726,
"total_time_in_millis": 357211935,
"total_docs": 151162909,
"total_size_in_bytes": 723163025880,
"total_stopped_time_in_millis": 144355699,
"total_throttled_time_in_millis": 98852115,
"total_auto_throttle_in_bytes": 20971520
},
"refresh": {
"total": 10286,
"total_time_in_millis": 11816106
},
"flush": {
"total": 888,
"total_time_in_millis": 3204982
},

"segments": {
"count": 171,
"memory_in_bytes": 385503425,
"terms_memory_in_bytes": 355067245,
"stored_fields_memory_in_bytes": 28580864,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 1002304,
"doc_values_memory_in_bytes": 853012,
"index_writer_memory_in_bytes": 46442972,
"index_writer_max_memory_in_bytes": 2147483648,
"version_map_memory_in_bytes": 617064,
"fixed_bit_set_memory_in_bytes": 0
},
"translog": {
"operations": 119174,
"size_in_bytes": 409953585
},

"total": {
"docs": {
"count": 45010871,
"deleted": 0
},

Based on the disk volume I would guess you are using spinning disks. SSDs might be able to handle this type of load better. In my experience the indexing thoughput typically drop with shard size, so you may be better off with a larger number of smaller shards. When determining the ideal number, you naturally need to balance the requirements of query and indexing load and find your optimum number.

Yes I am using spinning disks.
Happy to try increasing the number of shards just to try it out. What's a reasonable number to aim for?
I'll give it a go

My query levels are very very low - its just research. Ingesting/indexing is my main priority/activity/

If you are using multiple paths, I would set it so that each disk gets a shard. If you have raided the storage I would increase it by perhaps a factor of 4 or 6 to see if it makes a difference. If it helps you can potentially try somewhat larger numbers, but I would not go crazy as it may have an impact on query performance.

I am using multiple paths so a shard per disk makes sense so I will give that a go. Thanks

I have 12 disks with 12 mount points. If I configure 12 shards will ES sort that out or do I need to do anything manual to have a shard per disk?

I don't think this is working any more with Logstash and ES at 6.0.
I can't see the 'document_id' being used and can only see _id in the index. Can you spot what has changed to stop this working?

    ruby {
      code => "require 'digest/md5';
      event.set( '[@metadata][computed_id]', Digest::MD5.hexdigest(event.get('[ip]') + event.get('[fingerprint_sha1]')))"
    }


      elasticsearch {
        user => xxxx
        password => xxxx
        action => "index"
        document_id => "%{[@metadata][computed_id]}"
        hosts => ["hdp13","hdp14","hdp15","hdp16"]
        index => "new_ssl-%{+YYYY.MM}"
        manage_template => false
        template_name => new_ssl1
        template_overwrite => true
      }

Why are you using the ruby filter to set this instead of the fingerprint filter?

What do the ids look like when using this config?