Elastic hard delete


(Nilesh) #1

Hello,

I have created an Index of 200M docs and will be updated frequently. 50M docs will be updated in a month. In short, the Index is read/write heavy in nature.
As Elastic/Lucene says it will not do an actual update but it will delete/add, that means the deleted docs will resides in Index but will not be searchable.
Lucene occasionally merges segments according to merge policy, which is costlier.

So my question is

  1. How would be my Index read/write performant in such scenario?
  2. Is there any alternative to deal with such scenario?
  3. Can we disable soft delete in Elastic/Lucene and allow only hard delete?

Thanks,
Nilesh


(Christian Dahlqvist) #2

How often is frequently? Which version of Elasticsearch are you using?

Correct. They will only be physically deleted from disk during a merge.

It depends on the use case and the questions I asked earlier.

I am not sure I understand your question.

No, that is not possible.


(Nilesh) #3

Version : 5.4
In a month,

  1. for 7 days bulk update 30-35M (from data source 1)
  2. daily 1M bulk update. (from data source 2..n)
  3. +Heavy search

(Nik Everett) #4

The only way to not use the delete-is-mark-then-merge-away behavior is to create an entirely new index and then drop the old one. Some people do this, but mostly because they don't have triggers to sync changes so they must so it periodically.

I ran a system that had many updates. Performance was fine. The standard advice of setting the refresh time to 30 seconds if you can tolerate it is good here. As is watching updates and dedupicating them.


(Christian Dahlqvist) #5

What is the problem you are trying to solve?


(Nilesh) #6

We are building a system that takes(update) data from multiple sources and update the master data. At the same time, consumer of the master should be able to get the updated data from master real time.


(Christian Dahlqvist) #7

Are you seeing any performance problems you are trying to address?


(Nilesh) #8

Did POC using following configuration
3 node cluster
RAM : 60gb
HD : 1.5 TB SSD
CPU : 8 core

POC :
Indexing and searching(fuzzy query) simultaneously in multi-threaded environment.
What I observed is, Indexing slows down drastically. If I do only indexing then the performance is very good.


(Christian Dahlqvist) #9

What was limiting performance when you indexed and searched at the same time? CPU? Disk I/O? Did you see any reports in the logs about long or slow GC?


(Nilesh) #10

How you handled deleted docs?
Are you relying on Elastic/Lucene to purge?
What settings you used?


(Nilesh) #11

Using htop, got to know CPU & RAM.
Even if I increased the infrastructure, purging will not affect performance?


(Christian Dahlqvist) #12

Indexing, merging and querying use the same system resources, so off course they can affect performance and each other. You need to test to see that you have enough system resources available for the combined load and that you at that point are able to index and query with acceptable performance.


(Nilesh) #13

Thanks Christian_Dahlqvist & nik9000!