I currently am planning on building out to a 4 Elasticsearch data node
cluster from currently at 2 and have a question regarding how many shards
to use for the indexes. I am running the ELK stack and currently each index
file, one per day, is creating 5 shards per node. As you can imagine this
will create a lot of shards across the nodes over a period of time. I have
read that having too many shards is bad for the cluster's health. Is there
a better way to calculate the best shard / replica strategy to avoid issues
but maintain redundancy? Thanks for your help.
The number of shards will help you scale out in case you add more nodes in
the future. With your current shard count at 5, you cannot optimally deploy
and distribute a 6+ node cluster. However, your data is time-based, one per
day. Are queries on historical data important? I would start off with a
shard count of 4 per index, letting node receive part of the index (ideally
more of the index with replication) and then change the shard count in case
you increase your cluster. Your older indices may not be optimally
distributed, but your new ones, and presumedly your more important ones,
will be.
I currently am planning on building out to a 4 Elasticsearch data node
cluster from currently at 2 and have a question regarding how many shards
to use for the indexes. I am running the ELK stack and currently each index
file, one per day, is creating 5 shards per node. As you can imagine this
will create a lot of shards across the nodes over a period of time. I have
read that having too many shards is bad for the cluster's health. Is there
a better way to calculate the best shard / replica strategy to avoid issues
but maintain redundancy? Thanks for your help.
Thanks for the reply. So if I store data, one index per day, across 6 data
nodes (4 or 5 shards each node) for a year..that's something like 10,000
shards in the cluster. Does that make sense? And also, is this safe?
On Saturday, October 18, 2014 2:41:50 PM UTC-4, Ivan Brusic wrote:
The number of shards will help you scale out in case you add more nodes in
the future. With your current shard count at 5, you cannot optimally deploy
and distribute a 6+ node cluster. However, your data is time-based, one per
day. Are queries on historical data important? I would start off with a
shard count of 4 per index, letting node receive part of the index (ideally
more of the index with replication) and then change the shard count in case
you increase your cluster. Your older indices may not be optimally
distributed, but your new ones, and presumedly your more important ones,
will be.
Cheers,
Ivan
On Sat, Oct 18, 2014 at 7:04 AM, <elo...@gmail.com <javascript:>> wrote:
Hi All,
I currently am planning on building out to a 4 Elasticsearch data node
cluster from currently at 2 and have a question regarding how many shards
to use for the indexes. I am running the ELK stack and currently each index
file, one per day, is creating 5 shards per node. As you can imagine this
will create a lot of shards across the nodes over a period of time. I have
read that having too many shards is bad for the cluster's health. Is there
a better way to calculate the best shard / replica strategy to avoid issues
but maintain redundancy? Thanks for your help.
Each shard is a Lucene index, so it will consume resources at the file
system level. Elasticsearch itself will be able to handle the coordination
between many shards. You next need to think about how much data each shard
actually has. Distributed logging can create volumes of logs, perhaps too
much for a 4 node cluster.
Thanks for the reply. So if I store data, one index per day, across 6 data
nodes (4 or 5 shards each node) for a year..that's something like 10,000
shards in the cluster. Does that make sense? And also, is this safe?
On Saturday, October 18, 2014 2:41:50 PM UTC-4, Ivan Brusic wrote:
The number of shards will help you scale out in case you add more nodes
in the future. With your current shard count at 5, you cannot optimally
deploy and distribute a 6+ node cluster. However, your data is time-based,
one per day. Are queries on historical data important? I would start off
with a shard count of 4 per index, letting node receive part of the index
(ideally more of the index with replication) and then change the shard
count in case you increase your cluster. Your older indices may not be
optimally distributed, but your new ones, and presumedly your more
important ones, will be.
I currently am planning on building out to a 4 Elasticsearch data node
cluster from currently at 2 and have a question regarding how many shards
to use for the indexes. I am running the ELK stack and currently each index
file, one per day, is creating 5 shards per node. As you can imagine this
will create a lot of shards across the nodes over a period of time. I have
read that having too many shards is bad for the cluster's health. Is there
a better way to calculate the best shard / replica strategy to avoid issues
but maintain redundancy? Thanks for your help.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.