Impact of deletion on rewriting data

sambodhi · August 4, 2018, 11:28am

I am ingesting 40 GB data at a time after deleting the previous data for the same period. I have to delete before since new version might be missing some rows. Since update wouldn't remove he unwanted rows, am deleting before re-ingesting. Can deleting right before ingestion can performance issues in writing data?

Cluster details:
7 node x r4.2xlarge (8 vCPU, 61 GB - 32 assigned to ES) on AWS
ES 6.0 30GB per node (not 32GB)
210GB memory / 5TB disk space
Linux Red Hat 4.8.3-9 4.4.15-25.57.amzn1.x86_64 Java 1.8

Index size: 1.7b documents / 1.8 TB / 28 Shards

Thanks

dadoonet · August 4, 2018, 1:59pm

If you are removing data based on a given date, I'd suggest to use time based indices and just drop the unneeded indices.

sambodhi · August 4, 2018, 3:32pm

Yes we plan to make that change. But we have something in production already, where we are facing slow writes and am guessing it is because of the deletion we do juts before. So I am looking for how to get this sorted for now and we will fix it permanently by having time based indices

sambodhi · August 4, 2018, 6:00pm

@dadoonet thanks for your reply. Can having multiple indices impact on performance? For example currently we have around 2 TB data in 28 primary shards. If we divide indices by week (lets day) and we have 52 indices. so querying 6 months of data, it will query 26*5 = 130 shards or may be it is better to have lesser shards per index.

dadoonet · August 4, 2018, 9:44pm

Why would you keep so many shards?

sambodhi · August 5, 2018, 8:51am

We started with 7 shards but our data was growing fast and we reached like 120GB per shard. We started facing problems with it with ingestion and slow read performance when cluster is relocating etc. To keep 50GB/shard, we chose 28 shards.

Ok, I realised we can't easily split by date because we do parent/child to do absolute distinct (no approximations but accurate) which normally ES would not allow since it uses hyperloglog with cardinality 40,000. Probably, ES was a wrong choice for this. a) there was no way to do absolute distinct for higher cardinality, we worked around with parent/child but its not great b) Because of this we cannot even split the index.

So with current problem of deleting and re-ingesting, I see the difference in indexing speed when 1) Just ingest data 2) I delete and ingest. In the graph below index rate is consistent in case 1 and slows down in 2

Can you please help me understand what causes this and if there can anything done to sort this for now (for example, may be deleting the data in advance)? Thank you

sambodhi · August 5, 2018, 11:46am

peaks before red line is 1 and after is 2
even after deletion has finished, ingestion is slow

dadoonet · August 6, 2018, 9:07am

Did you run a force merge after delete operation?

sambodhi · August 6, 2018, 10:05am

ah no we didn't force merge. Thanks for pointing! Probably that leads to this overlap between deletion and ingestion. probably we should call this api https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html after after deletion and before re-ingesting new data? I am guessing could be a heavy operation to do over 2-3 TB data.

Also we run deletion and ingestion with refresh_interval = -1. Does that needs to be reset as well or force merge would be sufficient?

dadoonet · August 6, 2018, 12:02pm

Use Force merge API | Elasticsearch Guide [8.11] | Elastic with only_expunge_deletes so you will "just" remove documents that needs to be erased.

I'm not sure if this will help or not, but I'd give it a try.

Also we run deletion and ingestion with refresh_interval = -1. Does that needs to be reset as well or force merge would be sufficient?

When do you refresh the index then? Are calling refresh manually?

sambodhi · August 7, 2018, 2:23pm

We have 4 ingestion jobs running weekly across 2 indices. We set refresh_interval=-1 before these set of jobs runs and reset it to refresh_interval=1s at the end.

Actually we have tried force merge before (with only_expunge_deletes) when our cluster went into yellow state (that time we had very big shards, holding around 220GB data each so that could be a reason), we ended up reindexing to a new index with took around 2 days. From that experience we realised this is an heavy operation and little concerned to do it every week.

dadoonet · August 7, 2018, 5:07pm

Only "good" solution IMO is still:

If you are removing data based on a given date, I'd suggest to use time based indices and just drop the unneeded indices.

system · September 4, 2018, 5:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is there any performance impact on Elastic searches when you have a lot of writes on index? Elasticsearch	5	1835	February 17, 2020
Deleting data and avoiding reindexing Elasticsearch	9	4017	May 16, 2017
Elasticsearch Delete Index Performance Elasticsearch	5	2681	August 28, 2020
Performance Impact of Deleting an Index Elasticsearch	3	782	March 24, 2021
Performance Problems Elasticsearch user-experience	28	1361	February 26, 2024

Impact of deletion on rewriting data

Related Topics