I am setting up a new Elasticsearch cluster that will hold around 300 million records in a single index to start. The documents are academic paper data. Right now I have two nodes, so I set my shards as 2 primary and 1 replica. But at 75 million records I am already at 55gb per shard with 4 shards total. This caused a warning in Elasticsearch for 'large shard size'.
From what I have read I should resolve this. But what is the best of these options?
Simply create more primary shards but keep the same number of nodes (I know I would have to re-index).
Increase number of nodes and add replicas (could be very expensive and somewhat wasteful)
Split the index using publication year or something like that. This would then lead to ~300 indexes rather than the one.
I need to search across all the data for most queries. Any suggestions?
Ok great I'm giving that a shot to see how it performs. One more question - since my use case is search heavy, what do you think of the idea of having 1 primary shard and 1 replica shard to begin with, then add replicas if I decide to increase my nodes? That way I reduce the overhead of having multiple primary shards.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.