Break up large index into multiple smaller equally sized indexes?

Whats the best way to break up large index into multiple smaller equally sized indexes?
I've looked at using reindex with a query but it seems extremely slow:

    XPOST localhost:9200/_reindex
    {
      "source": {
        "index": "$INDEX",
        "query": {
          "range": {
            "@timestamp": {
              "format": "date_time",
              "gte": "2021-01-24T00:00:00.000Z",
              "lt":  "2021-01-24T13:00:00.000Z"
            }
          }
        }
      },
      "dest": {
        "index": "$NEWINDEX"
      }
    }

I just want to break the index up into smaller chunks, doesn't have to be based on date, could be on size, say 5 equal pieces.

You could use a new [ILM]ILM: Manage the index lifecycle | Elasticsearch Reference [7.11] | Elastic) policy and then reindex into that.

Hello @stevedwray , an option to tackle the slowness while reindexing is to disable the replicas of the new index and to throttle the requests.

I have looked at ILM and do use it. However, this is a one-off where theres a historical index that I need to split up.

I do that already.
I noticed that when I am running a reindex with the above settings it takes an extremely long time before any documents are written. I'm thinking that its slowly working its way through the large index until it finds documents that match and then reindexes with those. As I'd need to repeat this operation several times, it would take far too long.

So I'm wondering if theres an 'official' way to break up indexes into smaller indexes.

You can still use it for that.

Nope, you're free to choose your own approach

Let me rephrase that; "A best-practise way to break up indexes into smaller indexes".

The answer there is still the same :slight_smile:

So the best way to do it is using ILM for this one-off job? Is that how you'd approach it?
And there is no 'best practice', just people do it however they figure it out?

1 Like

That's how I would do it, yes.
It allows you to define and automatically create the indices based on the size you want. It also makes sure they are easily queryable via an alias.

I'm trying it out, created an ILM and attached it to this index. I'll see how it turns out in the morning!
Thanks for the tip.
It does seem a bit nicer than running lots of curl API calls with queries etc.

1 Like

What is the reason for breaking it up? If the shards are too large and affecting performance you can simply use the split index API to increase the number of primary shards. Querying a single index with X shards is not much different to querying X indices with 1 primary shard each from a performance perspective.

The index contains packetbeat data and was allowed to grow too large without ILM rotating it. A lot of that data is flow which is not needed. The index covers a period thats got valuable historical data.
I've been trying to re-index it into another index without the flow data but it goes incredibly slowly, as in 4 days later it hasn't done a quarter. I've done the same with other indexes that were about 100G and they went very fast, hours. I reasoned that if I first break this large index down into indexes of 100G or so each, it could go faster. Divide and conquer. I have limited time to get this done before I move on to another job and I'd like to get it done before then.

I made an ILM with the parameters set to hot phase, roll over on maximum index size of 50G and maximum age of 1 day and attached this index to it, this seems to be set fine, the index shows as being managed by this ILM.
But it doesn't appear to have kicked in and done anything with the index. Do I need to manually fire it off.

Manually fire what off? The reindex - yes. The ILM policy just handles index rotation, not the reindex.

So if I've attached the ILM to an index, and told it to rotate the index when it exceeds 50GB, should that break it down into indexes of 50GB each?

Yep.

It hasn't done that so I must have got something wrong.
When I run
GET packetbeat-7.10.0-2021.01.20-000099/_ilm/explain
I get:

{
  "indices" : {
    "packetbeat-7.10.0-2021.01.20-000099" : {
      "index" : "packetbeat-7.10.0-2021.01.20-000099",
      "managed" : true,
      "policy" : "split_large_index",
      "lifecycle_date_millis" : 1611716731954,
      "age" : "35.81d",
      "phase" : "warm",
      "phase_time_millis" : 1611716735321,
      "action" : "complete",
      "action_time_millis" : 1611716735493,
      "step" : "complete",
      "step_time_millis" : 1611716735493,
      "phase_execution" : {
        "policy" : "split_large_index",
        "phase_definition" : {
          "min_age" : "0d",
          "actions" : { }
        },
        "version" : 3,
        "modified_date_in_millis" : 1614800913433
      }
    }
  }
}

Can you share the policy and how you attached it to the index?