Setting up elasticsearch to scale: shards per index

Matt_7 · February 10, 2014, 7:23pm

Hey all!

Background: I am using elasticsearch with logstash to do some log analysis.
My use-case is write-heavy, and I have configured ES accordingly. After
experimenting with different setups, I am considering the following
implementation:

separate log processing from ES cluster
1x Logstash server

2x ES server (1x master, 1x data-only):

17GB memory
Running single ES node with 9GB allocated memory

This should be plenty of memory for the relatively small dataset I am
starting with, and will expand as needed. However, I have the following
questions/concerns:

It is my understanding that, ideally, we want one shard per index per node
(plus an additional replica shard per primary shard per node assuming
number of replicas is set to 1), meaning in this setup, I would set number
of shards per index to 2. Each index is, as of now, relatively small
(~500MB), so two shards should be fine. However, as we scale the project,
the indexes will grow, and we will eventually want to split them into more
shards. On the hardware side, the ES servers are relatively lightweight.
As we scale, we have the option to simply beef up the hardware. Finally,
my understanding is that increasing the number of shards/index down the
line requires a full reindexing of the data, which I would like to avoid.

It seems to me that I would be better off setting shards/index to 4, in
anticipation of future scaling. Are there costs to this that I am missing?

What about starting off with a single ES node on a beefier server? Should I
be concerned about availability with a single-node cluster (no replicas)?

Thanks for reading

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6c988e28-58dc-4af7-86a9-16d763ce4ff7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

nik9000 · February 10, 2014, 7:30pm

If you have an index per (some time period) then you can create the new indexes with more shards when you have more hardware and leave the old ones with their old number. You can also allocate more shards then noses in preparation for getting more nodes. I believe those are the standard tactics for log stash.

Sent from my iPhone

On Feb 10, 2014, at 2:23 PM, Matt fulton.matthew@gmail.com wrote:

Hey all!

Background: I am using elasticsearch with logstash to do some log analysis. My use-case is write-heavy, and I have configured ES accordingly. After experimenting with different setups, I am considering the following implementation:

separate log processing from ES cluster
1x Logstash server

2x ES server (1x master, 1x data-only):

17GB memory

Running single ES node with 9GB allocated memory

This should be plenty of memory for the relatively small dataset I am starting with, and will expand as needed. However, I have the following questions/concerns:

It is my understanding that, ideally, we want one shard per index per node (plus an additional replica shard per primary shard per node assuming number of replicas is set to 1), meaning in this setup, I would set number of shards per index to 2. Each index is, as of now, relatively small (~500MB), so two shards should be fine. However, as we scale the project, the indexes will grow, and we will eventually want to split them into more shards. On the hardware side, the ES servers are relatively lightweight. As we scale, we have the option to simply beef up the hardware. Finally, my understanding is that increasing the number of shards/index down the line requires a full reindexing of the data, which I would like to avoid.

It seems to me that I would be better off setting shards/index to 4, in anticipation of future scaling. Are there costs to this that I am missing?

What about starting off with a single ES node on a beefier server? Should I be concerned about availability with a single-node cluster (no replicas)?

Thanks for reading

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6c988e28-58dc-4af7-86a9-16d763ce4ff7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2864CCBA-4F2F-4F88-A734-300292CD67FB%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 10, 2014, 8:32pm

Do not put single-node cluster (no replicas) in production. Never use
single nodes except for development and demo. Always use replica and at
least 3 nodes with minimum master nodes = 2 to avoid splitbrain in
production.

Having 17G RAM on a master-only server is more than enough of a beefy
server if you ask me.

Jörg

On Mon, Feb 10, 2014 at 8:23 PM, Matt fulton.matthew@gmail.com wrote:

What about starting off with a single ES node on a beefier server? Should
I be concerned about availability with a single-node cluster (no replicas)?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF6rTjAUs9_sA8%2BayePVK4cnscXbWttXmHZMuM-m1zfSg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Matt_Fulton · February 10, 2014, 9:32pm

Thanks for the response! Following your response, I read up on the split
brain problem and am moving to a 3-node cluster with one master-only (on
logstash server), one master/data, and one data-only.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0cd3075b-0c37-4036-83ec-86cf423879fe%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Matt_7 · February 10, 2014, 9:35pm

Thanks for the responses! After reading up on the split brain problem, I
am moving to a three-node cluster with one master-only (on logstash
server), one master/data, and one data-only server

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c1a51b4b-be54-4d21-b3f2-42d8ea361699%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 11, 2014, 7:58am

That is a misconception. To avoid split-brain, it is a good idea to set up
master-only (data-less) nodes so they can not face heap problems. But that
is not all, more important is to have at least three master eligible nodes,
in case you get network disruptions. Additionally, you should set up at
least two data nodes, otherwise your data is not fault tolerant against
data loss if one server fails. Replica level 1 requires two data nodes.

Jörg

On Mon, Feb 10, 2014 at 10:35 PM, Matt fulton.matthew@gmail.com wrote:

Thanks for the responses! After reading up on the split brain problem, I
am moving to a three-node cluster with one master-only (on logstash
server), one master/data, and one data-only server

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGNtj_nMUeR4m_OtD22_RU6qXKtrKOJnKe6L9vymwnUiQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Matt_7 · February 11, 2014, 2:55pm

Hey thanks for clarifying! I actually ended up setting it up as 1x
master-only, 2x master-eligible data-nodes, realizing that I would need 3
eligible masters while putting it all together.

On the heap problems, could you be more specific about what you are
referring to, or maybe point me towards a resource where I could learn more
about this?

If I am understanding this correctly, the minimum size cluster you are
suggesting is 5 nodes, with 3 master-only nodes and 2 data-only nodes.
Thinking ahead, each additional data node will require an additional
master-only node.

Thanks for taking time to help out!

On Tue, Feb 11, 2014 at 2:58 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

That is a misconception. To avoid split-brain, it is a good idea to set up
master-only (data-less) nodes so they can not face heap problems. But that
is not all, more important is to have at least three master eligible nodes,
in case you get network disruptions. Additionally, you should set up at
least two data nodes, otherwise your data is not fault tolerant against
data loss if one server fails. Replica level 1 requires two data nodes.

Jörg

On Mon, Feb 10, 2014 at 10:35 PM, Matt fulton.matthew@gmail.com wrote:

Thanks for the responses! After reading up on the split brain problem, I
am moving to a three-node cluster with one master-only (on logstash
server), one master/data, and one data-only server

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/a9LQ3_2up_A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGNtj_nMUeR4m_OtD22_RU6qXKtrKOJnKe6L9vymwnUiQ%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJqNZjzy41UZ0UZfSZoCUHsNdD7PCGgqnvWZ%3DeOoSXhZ7Utwmg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 11, 2014, 4:05pm

Three master nodes are enough, for as many data nodes as you wish to add.

You can search this mailing list for discussions where kimchy explained the
"dedicated master nodes", and how it fits for split-brain situations

For example

https://groups.google.com/forum/#!topic/elasticsearch/dxjpMd4vNXQ

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGkxGgMCD_0cdSv9erR1qupr2ceghaa5rXFsEbNO_aadw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Matt_7 · February 12, 2014, 2:45pm

Thanks for the help Jörg! In an effort to make this a base that is easily
scalable from, we have moved to a 5-node cluster with 3 dedicated masters
and 2 data nodes.

Matt

On Tue, Feb 11, 2014 at 11:05 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Three master nodes are enough, for as many data nodes as you wish to add.

You can search this mailing list for discussions where kimchy explained
the "dedicated master nodes", and how it fits for split-brain situations

For example

Redirecting to Google Groups

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/a9LQ3_2up_A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGkxGgMCD_0cdSv9erR1qupr2ceghaa5rXFsEbNO_aadw%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJqNZjwNiiALZS9%3Ddf5JDxuBNmSQ0nDsc_1Wb70FRjMuAm_AeA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Index/Shard Design on ES cluster Elasticsearch	2	389	July 6, 2017
Ideal number of shards for 2 node ES cluster Elasticsearch	4	447	March 26, 2020
Suggestion on Elasticsearch scaling and performance for log management Elasticsearch	9	710	October 15, 2019
How many Shards / Replicas Elasticsearch	9	9824	July 5, 2017
Shards per CPU Elasticsearch	5	4115	July 5, 2017

Setting up elasticsearch to scale: shards per index

Thanks for reading

Related topics