Break up large index into multiple smaller equally sized indexes?

stevedwray · March 1, 2021, 3:25am

Whats the best way to break up large index into multiple smaller equally sized indexes?
I've looked at using reindex with a query but it seems extremely slow:

    XPOST localhost:9200/_reindex
    {
      "source": {
        "index": "$INDEX",
        "query": {
          "range": {
            "@timestamp": {
              "format": "date_time",
              "gte": "2021-01-24T00:00:00.000Z",
              "lt":  "2021-01-24T13:00:00.000Z"
            }
          }
        }
      },
      "dest": {
        "index": "$NEWINDEX"
      }
    }

I just want to break the index up into smaller chunks, doesn't have to be based on date, could be on size, say 5 equal pieces.

warkolm · March 1, 2021, 3:33am

You could use a new [ILM]ILM: Manage the index lifecycle | Elasticsearch Reference [7.11] | Elastic) policy and then reindex into that.

Eduardo_Iglesias · March 1, 2021, 4:42pm

Hello @stevedwray , an option to tackle the slowness while reindexing is to disable the replicas of the new index and to throttle the requests.

stevedwray · March 1, 2021, 7:11pm

I have looked at ILM and do use it. However, this is a one-off where theres a historical index that I need to split up.

stevedwray · March 1, 2021, 7:13pm

I do that already.
I noticed that when I am running a reindex with the above settings it takes an extremely long time before any documents are written. I'm thinking that its slowly working its way through the large index until it finds documents that match and then reindexes with those. As I'd need to repeat this operation several times, it would take far too long.

So I'm wondering if theres an 'official' way to break up indexes into smaller indexes.

warkolm · March 2, 2021, 2:00am

You can still use it for that.

warkolm · March 2, 2021, 2:00am

Nope, you're free to choose your own approach

stevedwray · March 2, 2021, 2:02am

Let me rephrase that; "A best-practise way to break up indexes into smaller indexes".

warkolm · March 2, 2021, 2:09am

The answer there is still the same

stevedwray · March 2, 2021, 2:11am

So the best way to do it is using ILM for this one-off job? Is that how you'd approach it?
And there is no 'best practice', just people do it however they figure it out?

warkolm · March 2, 2021, 2:22am

That's how I would do it, yes.
It allows you to define and automatically create the indices based on the size you want. It also makes sure they are easily queryable via an alias.

stevedwray · March 2, 2021, 2:23am

I'm trying it out, created an ILM and attached it to this index. I'll see how it turns out in the morning!
Thanks for the tip.
It does seem a bit nicer than running lots of curl API calls with queries etc.

Christian_Dahlqvist · March 2, 2021, 7:32am

What is the reason for breaking it up? If the shards are too large and affecting performance you can simply use the split index API to increase the number of primary shards. Querying a single index with X shards is not much different to querying X indices with 1 primary shard each from a performance perspective.

stevedwray · March 3, 2021, 7:38pm

The index contains packetbeat data and was allowed to grow too large without ILM rotating it. A lot of that data is flow which is not needed. The index covers a period thats got valuable historical data.
I've been trying to re-index it into another index without the flow data but it goes incredibly slowly, as in 4 days later it hasn't done a quarter. I've done the same with other indexes that were about 100G and they went very fast, hours. I reasoned that if I first break this large index down into indexes of 100G or so each, it could go faster. Divide and conquer. I have limited time to get this done before I move on to another job and I'd like to get it done before then.

stevedwray · March 3, 2021, 7:50pm

I made an ILM with the parameters set to hot phase, roll over on maximum index size of 50G and maximum age of 1 day and attached this index to it, this seems to be set fine, the index shows as being managed by this ILM.
But it doesn't appear to have kicked in and done anything with the index. Do I need to manually fire it off.

warkolm · March 3, 2021, 10:24pm

Manually fire what off? The reindex - yes. The ILM policy just handles index rotation, not the reindex.

stevedwray · March 3, 2021, 10:40pm

So if I've attached the ILM to an index, and told it to rotate the index when it exceeds 50GB, should that break it down into indexes of 50GB each?

warkolm · March 3, 2021, 10:41pm

Yep.

stevedwray · March 3, 2021, 10:43pm

It hasn't done that so I must have got something wrong.
When I run
GET packetbeat-7.10.0-2021.01.20-000099/_ilm/explain
I get:

{
  "indices" : {
    "packetbeat-7.10.0-2021.01.20-000099" : {
      "index" : "packetbeat-7.10.0-2021.01.20-000099",
      "managed" : true,
      "policy" : "split_large_index",
      "lifecycle_date_millis" : 1611716731954,
      "age" : "35.81d",
      "phase" : "warm",
      "phase_time_millis" : 1611716735321,
      "action" : "complete",
      "action_time_millis" : 1611716735493,
      "step" : "complete",
      "step_time_millis" : 1611716735493,
      "phase_execution" : {
        "policy" : "split_large_index",
        "phase_definition" : {
          "min_age" : "0d",
          "actions" : { }
        },
        "version" : 3,
        "modified_date_in_millis" : 1614800913433
      }
    }
  }
}

warkolm · March 3, 2021, 10:43pm

Can you share the policy and how you attached it to the index?

Topic		Replies	Views
Splitting an existing Single Large Lndex (1tb) in to small without reindexing Elasticsearch	7	984	July 20, 2017
Reindexing a large collection into time based indices Elasticsearch	7	762	July 5, 2017
Reindex 1 index to multiple indexes Elasticsearch	8	560	June 15, 2023
Splitting a single index into daily indices Elasticsearch	5	3642	July 5, 2017
Split non-ILM large index Elasticsearch ilm-index-lifecycle-management , reindex	8	325	June 29, 2023

Break up large index into multiple smaller equally sized indexes?

Related topics