Deleting data and avoiding reindexing

(Jonathan Aaron) #1

I'm trying to figure out if elasticsearch will remove the segments of the data I delete on it's own or do I need to delete and reindex?

Every week we are indexing millions of documents and every week indexing is becoming more slow. Our process goes something like this. We have a Camel job that pulls down all of our mainframes data for the week. Take about 5 minutes. Then we use that data to insert into elasticsearch. We have 2 clients and 1 master and 6 datanodes 3 indices 24 shards per node. As of now it takes 23 hours to index 1 million documents(750 byte each) where it used to take about 30 minutes.

I'm wondering how can I improve performance? Is elastic garbage collection not optimized? Decreasing shards did not help either.



(Christian Dahlqvist) #2

Are you using time-based indices?

(Jonathan Aaron) #3

We are not. With our data model we have dept. based indices, 3 as of now. With each dept. index we are indexing about 1 million records. We haven't started deleting data yet, but will soon and we were worried that the data we will delete will still take up space as segments. Even without deleting indexing is slowing down to a crawl.

(Robert Cowart) #4

You can still have time-based indexes and an index per dept, e.g. index per dept. per day. With logstash this would be configured with something like...

index => "dept_a-%{+YYYY.MM.dd}"
index => "dept_b-%{+YYYY.MM.dd}"
index => "dept_c-%{+YYYY.MM.dd}"

Or if department name is a field in the data, something like...

index => "%{dept_name}-%{+YYYY.MM.dd}"

As far as shards go, given that you have 6 data nodes, I would probably start with something like 3 shards and 1 replica (6 total shards) per index. Of course with daily indexes that would be end up internally as 6 total shards per index per day.

3 million records really isn't that much data. I have handled that much fairly well on a single 2 core/8GB RAM VM, with plenty of room to spare. You really really need to setup proper daily indexes.


(Jonathan Aaron) #5

Besides deleting data, is there any other benefit to time-based indices? Will this help with memory management within elastic?

(Aaron Mildenstein) #6

Yes! Because if you rollover indices (whether with named daily indices, or using the _rollover API), you can do a forceMerge on them, which reduces the segment count, saving you valuable resources.

There are API calls you can make to do this, or you can use Elasticsearch Curator to help automate these tasks.

(Robert Cowart) #7

I will let the elastic folks comment definitively on the internals. However it basically comes down to working set size. If your working set can be contained within RAM (either the JVM or OS-level cache) things are going to work A LOT faster then when the disk must be touched. This is true even with SSD storage.

A shard is a Lucene index, and as I understand it Lucene will take advantage of OS-level caching (which is why it is recommended to use half of RAM for JVM and leave half for the OS). In your case your shards have probably grown so large that only a small portion fits in RAM. As new data is added and system is trying to figure out where to place it as well as update its internal structures the disk is being hit considerably. This is especially bad if you are using spinning disks.

Daily indexes will keep the working set smaller, ideally all in RAM, and thus much much faster.

NOTE: It sounds like you ingest data in batches. If you use are merging segments as @theuntergeek mentions, make sure you do so when no ingestion batches are running. Merging segments can be IO intensive, so it is probably best to schedule it when system load is expected to be minimal. As @theuntergeek suggests, Curator can be used to schedule such task for off-peak times.


(Jonathan Aaron) #8

Thanks, @theuntergeek and @rcowart. @theuntergeek How long does it take to do a _rollover? Would I do a forceMerge before or after the _rollover. The forceMerge blocks requests? What status code does ES return if it's blocking or how do I know when the merge is complete?

(Aaron Mildenstein) #9

A _rollover is practically instantaneous, because it creates a new index, then just points an alias at it. The data stream is always pointed at the alias, so it goes into whichever index it's pointed at.

The forceMerge would be done after the _rollover on the "old" index. When we say "blocking", it means that the current client connection cannot do anything else but wait for the forceMerge to complete. In fact, you must tell Curator to be prepared to wait a rather large number of seconds for a forceMerge to complete, via the client timeout or timeout_override options. The client will return control after the forceMerge is complete. Elasticsearch is more than happy to process other requests through other client connections while a forceMerge is happening, though any additional forceMerge requests will be blocked until the running one completes.

As per Curator, it simply will not move to the next action until the current one is complete.

(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.