Massive index compression

How can I easily compress 140 indices (some of which are up to 100 GB)?
I have tried

  1. closing the indices,
  2. changing the codec to 'best_compression,'
  3. and then executing a forcemerge on the index.

The forcemerge task finishes very quickly, and the index does not change its size.

However, if I manually reindex, the index is compressed by up to 60%.

Is there a way to do this in bulk? I don't mind having to do it with a Python script. Thank you!

"Note: I made a script to reindex one by one, but some reindexing takes more than 10 hours, and it's complex to sequence with so much delay between reindexations. I've also tried parallelizing several reindexations at once. But what I would like is to let this parallelization and sequencing be done and managed by the cluster itself."

Please don't ping folks to draw them into a conversation like that, especially not after just a few minutes. It's very rude and violates the community code of conduct. We're all just volunteers here.

1 Like

OK . Sorry David ! Sorry I didn't know it!

Forcemerging should be quicker than reindexing so I would recommend doing what you described. The only reason the forcemerge task would finish very quickly is if the index has already been forcemerged down to a single segment. If you are doing this as part of an index lifecycle policy I would recommend changing the codec there as well.

To force a forcemerge I think you need to increase the number of segments of the shards. You can do this by indexing a dummy document with a known ID and then immediately delete it. If your indices have more than 1 primary shard you may need to index and delete multiple documents so you know all shards have more than 1 segment.

Thks @Christian_Dahlqvist ! The thing is this indices are on the last stage of the ILM. So the forcemerge is done. So because of that even i tried to forcemerge nothing happpens.

What can i do to apply this massively? Can i apply like reindex in batch mode?

Any reindexing you will need to manage yourself, e.g. using a script. Why not add and delete a document to the index and then perform another forcemerge down to a single segment as I suggested instead of reindexing?

1 Like

Ok . So this can work? Just to clarify

  1. Add a document
  2. forcemerge to 1 segment .

Yes, that should work. If you do not want the additional document to pollute the index you can also remove it before running the forcemerge. Not sure whether you may also run a refresh or not.

it worked perfect! Thanks!

Now i have my 140 indexes on force merge queue ! But the thing is i have 3 servers with enough resources to made more than one force_merge at the same time.

I tried to increase the thread_pool of force_merge


PUT _cluster/settings
{
      "persistent" : {
        "thread_pool.force_merge.size" : 5
    }
}

But i get persistent setting [thread_pool.force_merge.size], not dynamically updateable .

Do i have any way of doing this without restarting servers? (it is a productive environment!)

No, I do not. I would however recommend not changing this. Forcemerging is primarily disk I/O intensive rather than very taxing on CPU and RAM/heap.