Large shard size

I am setting up a new Elasticsearch cluster that will hold around 300 million records in a single index to start. The documents are academic paper data. Right now I have two nodes, so I set my shards as 2 primary and 1 replica. But at 75 million records I am already at 55gb per shard with 4 shards total. This caused a warning in Elasticsearch for 'large shard size'.

From what I have read I should resolve this. But what is the best of these options?

  1. Simply create more primary shards but keep the same number of nodes (I know I would have to re-index).
  2. Increase number of nodes and add replicas (could be very expensive and somewhat wasteful)
  3. Split the index using publication year or something like that. This would then lead to ~300 indexes rather than the one.

I need to search across all the data for most queries. Any suggestions?

Given this is time based data, even if it's relatively static and not high velocity, I'd put things into yearly indices.

You can have more than 1 primary shard per host too :slight_smile:

Ok great I'm giving that a shot to see how it performs. One more question - since my use case is search heavy, what do you think of the idea of having 1 primary shard and 1 replica shard to begin with, then add replicas if I decide to increase my nodes? That way I reduce the overhead of having multiple primary shards.

I guess that depends on how large the primaries end up, but it's a solid place to start.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.