Lots of deleted documents above 40%

The bug that I mentioned was from Ryan Ernst reply.
Yes, I did big bulks (3000-10000) items in the request. You said this happened because 20 secs refresh and big bulk, you mean the disk space that still used by deleted objects or something else ?
Thanks

Sorry. I was answering on my mobile and I missed his reply.

It should be a combination of both. The referred bug and the fact that you are doing large bulk delete with a high refresh interval. This should be causing that large segments exists with a high count of deletes and, as explained earlier, large segments like that will take quite sometime to merge.

Thanks,
I'm trying to understand what refresh time is good for us, any thumb rules for this ? You said that 20 sec is too long, so what is good one, where in our case I don't need index to be real-time at all.

Thanks

I didn't say that 20 secs is too long. What I meant is that for this specific case (where you did a massive delete and there may be a bug playing a role) 20 secs seems not to be a good interval.

There is no rule of thumb. You just need to understand that the longer the interval, the longer it will take for changes to be visible in search in exchange for a slight performance boost when indexing.

So, why 20 sec is not good choice, my index is really don't need to be real time and even few minutes could be good for me, what is the other price for long refresh intreval.

Thanks

You can keep 20 sec. Just remember to set it to a smaller interval and use smaller bulk sizes when doing mass deletes to avoid this situation.

Thanks, btw how smaller bulks help to avoid this situation ?

With smaller bulks I would expect that the deletes gets spread in smaller segments, avoiding the huge segments issue that could be caused by the bug

But, isn't it that segment is "closed" when it's reached some size ? so large bulks ort small bulks, it's almost same chance to get into one segment ?

We are actually talking about bulk deletes, right? So on a second thought, for deletes it doesn't really matters as it depends on what segment document is.

If you are going to delete documents, you should certainly avoid large bulk indexing. The more you index in a segment, the worse it is going to be when you delete them and expect the huge segments to merge some time in the future.

We don't have any large bulk indexing, so just some no more than 50-100 documents in batch only for very small documents. But, if let's say I have indexed documents, and they are in index for some time now (more than 1 year), so it's a big chance that they were merged into one or more, but few segments. So, I was thinking that it's really don't matter, as at the end Lucene is trying to merge into few segments, and then when I delete, (bulk or not) , I'll have few but huge segments with lot of deleted data. Correct ? So, now only god knows when they will be cleared from the segment ?

Yes, that may indeed happen. I think you understand what the core issue is. That's why it's better to have separate indices (maybe time-based, or maybe per type) so you can efficiently manage those indices separately.

Mass deletes will always be a problem. It should be better in latest versions, but certainly are more problematic in earlier 1.x versions

It's indeed problem with deletions as I see it now, and as documents are older than bigger problem is.
I think it's some conceptual problem. I'm not sure about more recent versions, but in 1.7.x it is.
In our case indexes can't be time-based or separate, as we have many customers, to create index per customer is very expensive approach, so after discussing with Shay, a long time ago, we decided to go with one index, but many aliases, each customer has it's own alias, so in theory I can create for huge customer it's own index, but for small customers, separate index is a problem. It's also can't be time-based approach as I don't know how long customer will stays with us, ideally forever :slight_smile:. Looks like a dead end :slight_smile:

There may be a solution, but would require deeper analysis, the sort of thing that only with support subscription.

According to ES support price model, this is something that full reindex will be significant cheaper :slight_smile:
We tried few times to subscribe for the support and it was really expensive, not affordable :frowning:

Hi Salva_G,

If there is no way to divide your data into multiple indices then you can try dividing your customers data into multiple smaller clusters. This will help a lot of with both availability and performance. Though it does require you to build some sort of broker service which will tell you which customer's data is in which cluster.
Hope it helps,
Jigar

Thanks Jigar,
But your solution requires reindexing, and this is my concern.
Thanks

Or just use 1 cluster and have multiple indexes, and separate the customers between indexes. Ex: 4 indexes x 20 shards each. This way you won't need the service-broker since but just another index column/value for each customer. Still reindexing though.

Thanks, this is architecture that may be the best, but again, the issue here is conceptual, as delete items from index that were indexed a long time ago is problematic in Lucene, as Lucene is merging them into few segments and with the time there big segments that is very rare merged, so deletion of the index documents and further clean up of the space allocated to them for my understanding is not well defined (at least in Lucene 4.10).

Thanks

Can you at least upgrade ? Maybe it fixes things (will at least make them faster)