Hi,
we are currently planning a cluster which takes about 1TB/year worth of data. When using a standard shard size of 5 this would mean that the shards are 250GB. As far as I know shards should only be around 50-70GB at max. So we have 2 Options:
Increase the shard size to 20 or more.
Create a new index every month or so.
Questions:
How can l configure logstash so that it uses the right index when a new index is created? Zero downtime is required, thats why using one index seems way easier.
What creates the new index, where is it configured.
I can access all indices at the same time with an alias right?
An important question here is how long you need to keep the data. The by far most efficient way to delete data in Elasticsearch is to delete complete indices, and this is one of the main reasons why time-based indices are used. If you have a single index you need to delete using delete-by-query, which is much less efficient and will cause a lot higher load on your system.
I would recommend using time-based indices. You can either use rollover to create new indices based a combination of size and/or age or just make Logstash create indices with fixed periods by specifying a date pattern for the index. You can see an example of this here, just leave the dd portion for date out to create a monthly index.
Irrespective of whether you use rollover or time-based indices based on the index name you can use ILM to manage the rollover (if applicable) and retention.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.