2 clusters versus 1 big cluster?

Hey all,

I have 30 machines to build my ES cluster. I am going to need to build 2000
indexes with 14400 shards total. Should I build two clusters or one?

I have a multi customer data set. 65% of the customers have very little
data, "small customers". The other 35% have 100x more data,
"big customers". My thought was to build one cluster where all the
"small customers" share a set of 60 "monthly" indexes each with 100 shards.
I would build a second cluster where each customer gets their own set of 60
"monthly" indexes each with 6 shards. This configuration is so that I can
keep shard sizes below 5gb across both clusters. The small cluster is
routed by customerId. The big cluster uses the default routing strategy.

questions:

  1. Should I make two clusters or just one?
  2. Do I need to keep shard sizes below 5gb?
  3. Is management of one cluster with 2000 indexes and 14400 shards more
    difficult than 2 clusters where "small" cluster has 60 indexes and 600
    shards and "big" cluster has 1900 indexes and 13000 shards?

Thanks,
Brad

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9e14329b-ebb3-4948-9fd3-eaf10cce7477%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You should look into routing, instead of different sized clusters.
Shard size of 5-10GB is good, more can lead to increased recovery times.
Cluster management is easy with suitable configuration management - puppet,
chef etc.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 20 March 2014 09:16, Brad Jordan climberbrad@gmail.com wrote:

Hey all,

I have 30 machines to build my ES cluster. I am going to need to build
2000 indexes with 14400 shards total. Should I build two clusters or one?

I have a multi customer data set. 65% of the customers have very little
data, "small customers". The other 35% have 100x more data,
"big customers". My thought was to build one cluster where all the
"small customers" share a set of 60 "monthly" indexes each with 100 shards.
I would build a second cluster where each customer gets their own set of 60
"monthly" indexes each with 6 shards. This configuration is so that I can
keep shard sizes below 5gb across both clusters. The small cluster is
routed by customerId. The big cluster uses the default routing strategy.

questions:

  1. Should I make two clusters or just one?
  2. Do I need to keep shard sizes below 5gb?
  3. Is management of one cluster with 2000 indexes and 14400 shards more
    difficult than 2 clusters where "small" cluster has 60 indexes and 600
    shards and "big" cluster has 1900 indexes and 13000 shards?

Thanks,
Brad

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9e14329b-ebb3-4948-9fd3-eaf10cce7477%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/9e14329b-ebb3-4948-9fd3-eaf10cce7477%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624Z8HwRh8j-BJj2pBKea1H55CRZFVL668w1NNzQn%2Bgrh6A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Why limit shards to 5gb? Have you capacity tested your hardware + data and
determined the max size is around 5gb?

If the answer is no, I would encourage you to perform some capacity
planning first. It may be that your system can handle 15 or 20gb per
shard, or put a different way, 50m documents per shard. It is impossible
to know without testing however.

So first order of business is capacity planning. Create a single index,
with a single shard, and start indexing data and running searches. When
the response time/indexing throughput/ becomes
unsatisfying to your SLA...you've hit your limit. Write down the number of
docs in the shard, and the physical size. Now you can do some planning
and extrapolate from that value.

I would be wary of a cluster with so many shards, no matter how they are
partitioned. 14400 shards is a lot of wasted memory in terms of just raw
overhead. I think you may be underestimating the capacity of a single
shard, but it's hard to say without some benchmarks.

I would not necessarily be opposed to multiple clusters. Often they can
make logistics simpler, and prevent issues that crop up in extreme
multi-tenant environments like enormous cluster states that stall the
master.

-Zach

On Wednesday, March 19, 2014 5:16:01 PM UTC-5, Brad Jordan wrote:

Hey all,

I have 30 machines to build my ES cluster. I am going to need to build
2000 indexes with 14400 shards total. Should I build two clusters or one?

I have a multi customer data set. 65% of the customers have very little
data, "small customers". The other 35% have 100x more data,
"big customers". My thought was to build one cluster where all the
"small customers" share a set of 60 "monthly" indexes each with 100 shards.
I would build a second cluster where each customer gets their own set of 60
"monthly" indexes each with 6 shards. This configuration is so that I can
keep shard sizes below 5gb across both clusters. The small cluster is
routed by customerId. The big cluster uses the default routing strategy.

questions:

  1. Should I make two clusters or just one?
  2. Do I need to keep shard sizes below 5gb?
  3. Is management of one cluster with 2000 indexes and 14400 shards more
    difficult than 2 clusters where "small" cluster has 60 indexes and 600
    shards and "big" cluster has 1900 indexes and 13000 shards?

Thanks,
Brad

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b5d0896f-93a0-4b6b-a626-45629ca2f6d0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Brad

I agree with what Mark and Zachary have said and will expand on these.

Firstly, shard and index level operations in ElasticSearch are
peer-to-peer. Single-shard operations will affect at most 2 nodes, the node
receiving the request and the node hosting an instance of the shard
(primary or replica depending on the operation). Multi-shard operations
(such a searches) will affect from one to (N +1) nodes where N is the
number of shards in the index.

So from an index/shard operation perspective there is no reason to split
into two clusters. The key issue with index/shard operations is that the
cluster is able to handle the traffic volume. So if you do decide to split
out into to two clusters you will need to look at the load profile for each
of your client types to determine how much raw processing power you need in
each cluster. It may be that a 10:20 split is more optimum than a 15:15
split between clusters to balance request traffic, and therefore CPU
utilisation, across all nodes. If you go with one cluster this is not an
issue because you can move shards between nodes to balance the request
traffic.

Larger clusters also imply more work for the cluster master in managing the
cluster. This comes down to the number of nodes that the master has to
communicate with, and manage, and the size of the cluster state. A cluster
with 30 nodes is not too large for a master to keep track of. There will be
an increase in network traffic associated with the increase in volume of
master-to-worker and worker-to-master pings used to detect the
presence/absence of nodes. This can be offset by reducing the respective
ping intervals.

In a large cluster it is good practice to have a group of dedicated master
nodes, say 3, from which the master is elected. These nodes do not host any
user data meaning that cluster management is not compromised by high user
request traffic.

The size of the cluster state may be more of an issue. The cluster state
comprises all of the information about the cluster configuration. The
cluster state has records for each node, index, document mapping, shard,
etc. Whenever there is a change to the cluster state it is first made by
the master which then sends the updated cluster state to each worker node.
Note that the entire cluster state is sent, not just the changes! It is
therefore highly desirable to limit that frequency of changes to the
cluster state, primarily by minimizing dynamic field mapping updates, and
the overall size of the cluster state, primarily by minimizing the number
of indices.

In your proposed model the size of the cluster state associated the set of
60 shared month indices will be larger than that of one set of 60 dedicated
month indices by virtue of having 100 shards to 6. However, it may not be
much bigger because there will be much more metadata associated with
defining the index structure, notably the field mappings for all document
types in the index, than the metadata defining the shards of the index. So
it may well be that the size of the cluster state associated with 60
"shared" month indices plus N sets of 60 "dedicated" indices is not much
more than that of (N + 1) sets of 60 "dedicated" indices. So there may not
be much point in splitting to two clusters. A quick way to look at this for
your actual data model is to:

  1. Set up an index in ES with mappings for all document types and 6
    shards and 0 replicas,
  2. Retrieve the index metadata JSON using ES admin API,
  3. Increase the number of replicas to 16 (102 shards total),
  4. Retrieve the index metadata JSON using ES admin API,
  5. Compare the two JSON documents from 2 and 4.

As state above it is desirable to minimize the number of indices. Each
shard is a Lucene index which consumes memory and requires open file
descriptors from the OS for segment data files and Lucene index level
files. You may find yourself running out of memory and/or file descriptors
if you are not careful.

I understand you are looking for a design that will cater for on disc data
volume. Given that your data is split into monthly indices it may well be
that no one index, either "shared" or "dedicated" will reach that volume in
one month. There may also be seasonal factors to consider whereby one or
two months have much higher volumes than others. I have read/heard about
cases where a monthly index architecture was implemented but later scraped
for a single index approach because the month-to-month variation in volume
was detrimental to overall system resource utilisation and performance.

In you case think about whether monthly indices are really appropriate. An
alternative model is to partition one years worth of data into a set of
indices bounded by size rather than time. In this model a new index is
started on Jan 01 and data is added to it until it reaches some predefined
size limit, at which point a new index is created to accept new data from
that point on. This is repeated until year end at which point you might end
up with data for Jan/Feb/Mar and half of Apr in index 2014-01, the rest of
Apr plus May - Oct in index 2014-02 and Nov/Dec in index 2014-03. This way
you end up using the smallest number of indices within the constraints of
manageable shard size and overall user data volume. This may also be a
better approach than indices with 100 shards. This does, however, come at
the cost of more complexity when it comes to accessing data across multiple
indices, but this is a one off development cost rather than an ongoing
maintenance cost.

Also, rather than focusing so much on shard size look at the number and
size of the segment files comprising a shard. Remember that Lucene segments
are essentially immutable (except for marking deletes) after they are
written to disc. This means that some index management operations may
reduce down to simply copying/moving one or two segment files across the
network, rather than all segment files for a given shard. For example, when
allocating shards to nodes ElasticSearch tries to allocate shards to nodes
that already have some of that shard's segment data. The new incremental
backup/restore in ES V1 also takes advantage of this segment immutability.
With this in mind you might be able to support shards > 5G consisting of
more segment files of bounded size rather than fewer segment files of
unbounded size (and ultimately a single 5G segment file).

Hope this helps,
Regards
Mauri

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5072d2c9-a418-4afc-82e6-d2b8926d82c1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

@mauri, thank you for such interesting analysis.
On 21/03/2014 1:01 PM, "Mauri" mauri@proactive-edge.com.au wrote:

Hi Brad

I agree with what Mark and Zachary have said and will expand on these.

Firstly, shard and index level operations in Elasticsearch are
peer-to-peer. Single-shard operations will affect at most 2 nodes, the node
receiving the request and the node hosting an instance of the shard
(primary or replica depending on the operation). Multi-shard operations
(such a searches) will affect from one to (N +1) nodes where N is the
number of shards in the index.

So from an index/shard operation perspective there is no reason to split
into two clusters. The key issue with index/shard operations is that the
cluster is able to handle the traffic volume. So if you do decide to split
out into to two clusters you will need to look at the load profile for each
of your client types to determine how much raw processing power you need in
each cluster. It may be that a 10:20 split is more optimum than a 15:15
split between clusters to balance request traffic, and therefore CPU
utilisation, across all nodes. If you go with one cluster this is not an
issue because you can move shards between nodes to balance the request
traffic.

Larger clusters also imply more work for the cluster master in managing
the cluster. This comes down to the number of nodes that the master has to
communicate with, and manage, and the size of the cluster state. A cluster
with 30 nodes is not too large for a master to keep track of. There will be
an increase in network traffic associated with the increase in volume of
master-to-worker and worker-to-master pings used to detect the
presence/absence of nodes. This can be offset by reducing the respective
ping intervals.

In a large cluster it is good practice to have a group of dedicated master
nodes, say 3, from which the master is elected. These nodes do not host any
user data meaning that cluster management is not compromised by high user
request traffic.

The size of the cluster state may be more of an issue. The cluster state
comprises all of the information about the cluster configuration. The
cluster state has records for each node, index, document mapping, shard,
etc. Whenever there is a change to the cluster state it is first made by
the master which then sends the updated cluster state to each worker node.
Note that the entire cluster state is sent, not just the changes! It is
therefore highly desirable to limit that frequency of changes to the
cluster state, primarily by minimizing dynamic field mapping updates, and
the overall size of the cluster state, primarily by minimizing the number
of indices.

In your proposed model the size of the cluster state associated the set of
60 shared month indices will be larger than that of one set of 60 dedicated
month indices by virtue of having 100 shards to 6. However, it may not be
much bigger because there will be much more metadata associated with
defining the index structure, notably the field mappings for all document
types in the index, than the metadata defining the shards of the index. So
it may well be that the size of the cluster state associated with 60
"shared" month indices plus N sets of 60 "dedicated" indices is not much
more than that of (N + 1) sets of 60 "dedicated" indices. So there may not
be much point in splitting to two clusters. A quick way to look at this for
your actual data model is to:

  1. Set up an index in ES with mappings for all document types and 6
    shards and 0 replicas,
  2. Retrieve the index metadata JSON using ES admin API,
  3. Increase the number of replicas to 16 (102 shards total),
  4. Retrieve the index metadata JSON using ES admin API,
  5. Compare the two JSON documents from 2 and 4.

As state above it is desirable to minimize the number of indices. Each
shard is a Lucene index which consumes memory and requires open file
descriptors from the OS for segment data files and Lucene index level
files. You may find yourself running out of memory and/or file descriptors
if you are not careful.

I understand you are looking for a design that will cater for on disc data
volume. Given that your data is split into monthly indices it may well be
that no one index, either "shared" or "dedicated" will reach that volume in
one month. There may also be seasonal factors to consider whereby one or
two months have much higher volumes than others. I have read/heard about
cases where a monthly index architecture was implemented but later scraped
for a single index approach because the month-to-month variation in volume
was detrimental to overall system resource utilisation and performance.

In you case think about whether monthly indices are really appropriate. An
alternative model is to partition one years worth of data into a set of
indices bounded by size rather than time. In this model a new index is
started on Jan 01 and data is added to it until it reaches some predefined
size limit, at which point a new index is created to accept new data from
that point on. This is repeated until year end at which point you might end
up with data for Jan/Feb/Mar and half of Apr in index 2014-01, the rest of
Apr plus May - Oct in index 2014-02 and Nov/Dec in index 2014-03. This way
you end up using the smallest number of indices within the constraints of
manageable shard size and overall user data volume. This may also be a
better approach than indices with 100 shards. This does, however, come at
the cost of more complexity when it comes to accessing data across multiple
indices, but this is a one off development cost rather than an ongoing
maintenance cost.

Also, rather than focusing so much on shard size look at the number and
size of the segment files comprising a shard. Remember that Lucene segments
are essentially immutable (except for marking deletes) after they are
written to disc. This means that some index management operations may
reduce down to simply copying/moving one or two segment files across the
network, rather than all segment files for a given shard. For example, when
allocating shards to nodes Elasticsearch tries to allocate shards to nodes
that already have some of that shard's segment data. The new incremental
backup/restore in ES V1 also takes advantage of this segment immutability.
With this in mind you might be able to support shards > 5G consisting of
more segment files of bounded size rather than fewer segment files of
unbounded size (and ultimately a single 5G segment file).

Hope this helps,
Regards
Mauri

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5072d2c9-a418-4afc-82e6-d2b8926d82c1%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/5072d2c9-a418-4afc-82e6-d2b8926d82c1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACj2-4L%3DhmXjxC7ogE6gr02by78vji41qY0Zsa-YzxLxbQLDfw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Amazing! Thank you @Mauri. That is very helpful.

Our reasoning behind monthly indexes was simply that all of our queries
have a date range and a tenantID. We route by tenantID in the "shared"
monthly indexes. We are able to specify which indexes to query against and
only pass those index names to the ES API. This limits both the shards and
indexes that are queried when a single tenant asks for data within a date
range.

Thanks,
Brad

On Thursday, March 20, 2014 8:01:14 PM UTC-6, Mauri wrote:

Hi Brad

I agree with what Mark and Zachary have said and will expand on these.

Firstly, shard and index level operations in Elasticsearch are
peer-to-peer. Single-shard operations will affect at most 2 nodes, the node
receiving the request and the node hosting an instance of the shard
(primary or replica depending on the operation). Multi-shard operations
(such a searches) will affect from one to (N +1) nodes where N is the
number of shards in the index.

So from an index/shard operation perspective there is no reason to split
into two clusters. The key issue with index/shard operations is that the
cluster is able to handle the traffic volume. So if you do decide to split
out into to two clusters you will need to look at the load profile for each
of your client types to determine how much raw processing power you need in
each cluster. It may be that a 10:20 split is more optimum than a 15:15
split between clusters to balance request traffic, and therefore CPU
utilisation, across all nodes. If you go with one cluster this is not an
issue because you can move shards between nodes to balance the request
traffic.

Larger clusters also imply more work for the cluster master in managing
the cluster. This comes down to the number of nodes that the master has to
communicate with, and manage, and the size of the cluster state. A cluster
with 30 nodes is not too large for a master to keep track of. There will be
an increase in network traffic associated with the increase in volume of
master-to-worker and worker-to-master pings used to detect the
presence/absence of nodes. This can be offset by reducing the respective
ping intervals.

In a large cluster it is good practice to have a group of dedicated master
nodes, say 3, from which the master is elected. These nodes do not host any
user data meaning that cluster management is not compromised by high user
request traffic.

The size of the cluster state may be more of an issue. The cluster state
comprises all of the information about the cluster configuration. The
cluster state has records for each node, index, document mapping, shard,
etc. Whenever there is a change to the cluster state it is first made by
the master which then sends the updated cluster state to each worker node.
Note that the entire cluster state is sent, not just the changes! It is
therefore highly desirable to limit that frequency of changes to the
cluster state, primarily by minimizing dynamic field mapping updates, and
the overall size of the cluster state, primarily by minimizing the number
of indices.

In your proposed model the size of the cluster state associated the set of
60 shared month indices will be larger than that of one set of 60 dedicated
month indices by virtue of having 100 shards to 6. However, it may not be
much bigger because there will be much more metadata associated with
defining the index structure, notably the field mappings for all document
types in the index, than the metadata defining the shards of the index. So
it may well be that the size of the cluster state associated with 60
"shared" month indices plus N sets of 60 "dedicated" indices is not much
more than that of (N + 1) sets of 60 "dedicated" indices. So there may not
be much point in splitting to two clusters. A quick way to look at this for
your actual data model is to:

  1. Set up an index in ES with mappings for all document types and 6
    shards and 0 replicas,
  2. Retrieve the index metadata JSON using ES admin API,
  3. Increase the number of replicas to 16 (102 shards total),
  4. Retrieve the index metadata JSON using ES admin API,
  5. Compare the two JSON documents from 2 and 4.

As state above it is desirable to minimize the number of indices. Each
shard is a Lucene index which consumes memory and requires open file
descriptors from the OS for segment data files and Lucene index level
files. You may find yourself running out of memory and/or file descriptors
if you are not careful.

I understand you are looking for a design that will cater for on disc data
volume. Given that your data is split into monthly indices it may well be
that no one index, either "shared" or "dedicated" will reach that volume in
one month. There may also be seasonal factors to consider whereby one or
two months have much higher volumes than others. I have read/heard about
cases where a monthly index architecture was implemented but later scraped
for a single index approach because the month-to-month variation in volume
was detrimental to overall system resource utilisation and performance.

In you case think about whether monthly indices are really appropriate. An
alternative model is to partition one years worth of data into a set of
indices bounded by size rather than time. In this model a new index is
started on Jan 01 and data is added to it until it reaches some predefined
size limit, at which point a new index is created to accept new data from
that point on. This is repeated until year end at which point you might end
up with data for Jan/Feb/Mar and half of Apr in index 2014-01, the rest of
Apr plus May - Oct in index 2014-02 and Nov/Dec in index 2014-03. This way
you end up using the smallest number of indices within the constraints of
manageable shard size and overall user data volume. This may also be a
better approach than indices with 100 shards. This does, however, come at
the cost of more complexity when it comes to accessing data across multiple
indices, but this is a one off development cost rather than an ongoing
maintenance cost.

Also, rather than focusing so much on shard size look at the number and
size of the segment files comprising a shard. Remember that Lucene segments
are essentially immutable (except for marking deletes) after they are
written to disc. This means that some index management operations may
reduce down to simply copying/moving one or two segment files across the
network, rather than all segment files for a given shard. For example, when
allocating shards to nodes Elasticsearch tries to allocate shards to nodes
that already have some of that shard's segment data. The new incremental
backup/restore in ES V1 also takes advantage of this segment immutability.
With this in mind you might be able to support shards > 5G consisting of
more segment files of bounded size rather than fewer segment files of
unbounded size (and ultimately a single 5G segment file).

Hope this helps,
Regards
Mauri

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7dc08cb9-2fb3-4046-a977-db4f15019414%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.