I am setting up a new Elasticsearch cluster that will hold around 300 million records in a single index to start. The documents are academic paper data. Right now I have two nodes, so I set my shards as 2 primary and 1 replica. But at 75 million records I am already at 55gb per shard with 4 shards total. This caused a warning in Elasticsearch for 'large shard size'.
From what I have read I should resolve this. But what is the best of these options?
- Simply create more primary shards but keep the same number of nodes (I know I would have to re-index).
- Increase number of nodes and add replicas (could be very expensive and somewhat wasteful)
- Split the index using publication year or something like that. This would then lead to ~300 indexes rather than the one.
I need to search across all the data for most queries. Any suggestions?
