Upper limits on indexes/shards in a cluster

David_Ashby · October 20, 2014, 3:34pm

Hi,

We've been using elasticsearch on AWS for our application for two purposes:
as a search engine for user-created documents, and as a cache for activity
feeds in our application. We made a decision early-on to treat every
customer's content as a distinct index, for full logical separation of
customer data. We have about three hundred indexes in our cluster, with the
default 5-shards/1-replica setup.

Recently, we've had major problems with the cluster "locking up" to
requests and losing track of its nodes. We initially responded by
attempting to remove possible CPU and memory limits, and placed all nodes
in the same AWS placement group, to maximize inter-node bandwidth, all to
no avail. We eventually lost an entire production cluster, resulting in a
decision to split the indexes across two completely independent clusters,
each cluster taking half of the indexes, with application-level logic
determining where the indexes were.

All that is to say: with our setup, are we running into an undocumented
practical limit on the number of indexes or shards in a cluster? It ends
up being around 3000 shards with our setup. Our logs show evidence of nodes
timing out their responses to massive shard status-checks, and it gets
worse the more nodes there are in the cluster. It's also stable with only
two nodes.

Thanks,
-David

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6f5a8705-620f-4a41-8648-632c675d0291%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David_Ashby · October 20, 2014, 3:52pm

I might also note: the size of these indexes varies wildly, some being just
a few documents, some being thousands, more or less following the power law.

On Monday, October 20, 2014 11:34:36 AM UTC-4, David Ashby wrote:

Hi,

We've been using elasticsearch on AWS for our application for two
purposes: as a search engine for user-created documents, and as a cache for
activity feeds in our application. We made a decision early-on to treat
every customer's content as a distinct index, for full logical separation
of customer data. We have about three hundred indexes in our cluster, with
the default 5-shards/1-replica setup.

Recently, we've had major problems with the cluster "locking up" to
requests and losing track of its nodes. We initially responded by
attempting to remove possible CPU and memory limits, and placed all nodes
in the same AWS placement group, to maximize inter-node bandwidth, all to
no avail. We eventually lost an entire production cluster, resulting in a
decision to split the indexes across two completely independent clusters,
each cluster taking half of the indexes, with application-level logic
determining where the indexes were.

All that is to say: with our setup, are we running into an undocumented
practical limit on the number of indexes or shards in a cluster? It
ends up being around 3000 shards with our setup. Our logs show evidence of
nodes timing out their responses to massive shard status-checks, and it
gets worse the more nodes there are in the cluster. It's also stable
with only two nodes.

Thanks,
-David

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/17720132-eb50-4d49-bae5-8970e39b79dc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · October 20, 2014, 3:54pm

How many nodes do you have in your cluster?

Have you checked if your nodes run out of file descriptors or heap memory?

Jörg

On Mon, Oct 20, 2014 at 5:52 PM, David Ashby delta.mu.alpha@gmail.com
wrote:

I might also note: the size of these indexes varies wildly, some being
just a few documents, some being thousands, more or less following the
power law.

On Monday, October 20, 2014 11:34:36 AM UTC-4, David Ashby wrote:

Hi,

We've been using elasticsearch on AWS for our application for two
purposes: as a search engine for user-created documents, and as a cache for
activity feeds in our application. We made a decision early-on to treat
every customer's content as a distinct index, for full logical separation
of customer data. We have about three hundred indexes in our cluster, with
the default 5-shards/1-replica setup.

Recently, we've had major problems with the cluster "locking up" to
requests and losing track of its nodes. We initially responded by
attempting to remove possible CPU and memory limits, and placed all nodes
in the same AWS placement group, to maximize inter-node bandwidth, all to
no avail. We eventually lost an entire production cluster, resulting in a
decision to split the indexes across two completely independent clusters,
each cluster taking half of the indexes, with application-level logic
determining where the indexes were.

All that is to say: with our setup, are we running into an undocumented
practical limit on the number of indexes or shards in a cluster? It
ends up being around 3000 shards with our setup. Our logs show evidence of
nodes timing out their responses to massive shard status-checks, and it
gets worse the more nodes there are in the cluster. It's also stable
with only two nodes.

Thanks,
-David

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/17720132-eb50-4d49-bae5-8970e39b79dc%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/17720132-eb50-4d49-bae5-8970e39b79dc%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGU40R%2BUfBuM11eQgKdN7NGNK663hsQ_VC%3Dg9OPgxZQgw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

David_Ashby · October 20, 2014, 4:06pm

The unhealthy clusters were between four and five nodes. We switched to two
two-node clusters and those have been stable.

Bigdesk reports file descriptors, memory, and CPU all have plentiful
headroom in all cases.

On Monday, October 20, 2014 11:54:21 AM UTC-4, Jörg Prante wrote:

How many nodes do you have in your cluster?

Have you checked if your nodes run out of file descriptors or heap memory?

Jörg

On Mon, Oct 20, 2014 at 5:52 PM, David Ashby <delta.m...@gmail.com
<javascript:>> wrote:

I might also note: the size of these indexes varies wildly, some being
just a few documents, some being thousands, more or less following the
power law.

On Monday, October 20, 2014 11:34:36 AM UTC-4, David Ashby wrote:

Hi,

We've been using elasticsearch on AWS for our application for two
purposes: as a search engine for user-created documents, and as a cache for
activity feeds in our application. We made a decision early-on to treat
every customer's content as a distinct index, for full logical separation
of customer data. We have about three hundred indexes in our cluster, with
the default 5-shards/1-replica setup.

Recently, we've had major problems with the cluster "locking up" to
requests and losing track of its nodes. We initially responded by
attempting to remove possible CPU and memory limits, and placed all nodes
in the same AWS placement group, to maximize inter-node bandwidth, all to
no avail. We eventually lost an entire production cluster, resulting in a
decision to split the indexes across two completely independent clusters,
each cluster taking half of the indexes, with application-level logic
determining where the indexes were.

All that is to say: with our setup, are we running into an undocumented
practical limit on the number of indexes or shards in a cluster? It
ends up being around 3000 shards with our setup. Our logs show evidence of
nodes timing out their responses to massive shard status-checks, and it
gets worse the more nodes there are in the cluster. It's also stable
with only two nodes.

Thanks,
-David

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/17720132-eb50-4d49-bae5-8970e39b79dc%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/17720132-eb50-4d49-bae5-8970e39b79dc%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dc2d873d-28ed-40c9-94d9-eb1da37d1caa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · October 20, 2014, 7:33pm

Two nodes are not stable with regard to split brains.

All I can guess is that two nodes have a small volume of network traffic
and that you may have had network problems.

Without exact diagnostic messages it's hard to understand why nodes
disconnected. There are plenty of reasons. Networking is just one.

ES has no internal shard limits, except what is imposed by the
memory/CPU/network limits of the hardware (or VM). This does not mean you
can put an arbitrary number of shards or an arbitrary number of data volume
on a single machine. It all depends.

Jörg

On Mon, Oct 20, 2014 at 6:06 PM, David Ashby delta.mu.alpha@gmail.com
wrote:

The unhealthy clusters were between four and five nodes. We switched to
two two-node clusters and those have been stable.

Bigdesk reports file descriptors, memory, and CPU all have plentiful
headroom in all cases.

On Monday, October 20, 2014 11:54:21 AM UTC-4, Jörg Prante wrote:

How many nodes do you have in your cluster?

Have you checked if your nodes run out of file descriptors or heap memory?

Jörg

On Mon, Oct 20, 2014 at 5:52 PM, David Ashby delta.m...@gmail.com
wrote:

I might also note: the size of these indexes varies wildly, some being
just a few documents, some being thousands, more or less following the
power law.

On Monday, October 20, 2014 11:34:36 AM UTC-4, David Ashby wrote:

Hi,

We've been using elasticsearch on AWS for our application for two
purposes: as a search engine for user-created documents, and as a cache for
activity feeds in our application. We made a decision early-on to treat
every customer's content as a distinct index, for full logical separation
of customer data. We have about three hundred indexes in our cluster, with
the default 5-shards/1-replica setup.

Recently, we've had major problems with the cluster "locking up" to
requests and losing track of its nodes. We initially responded by
attempting to remove possible CPU and memory limits, and placed all nodes
in the same AWS placement group, to maximize inter-node bandwidth, all to
no avail. We eventually lost an entire production cluster, resulting in a
decision to split the indexes across two completely independent clusters,
each cluster taking half of the indexes, with application-level logic
determining where the indexes were.

All that is to say: with our setup, are we running into an undocumented
practical limit on the number of indexes or shards in a cluster? It
ends up being around 3000 shards with our setup. Our logs show evidence of
nodes timing out their responses to massive shard status-checks, and it
gets worse the more nodes there are in the cluster. It's also stable
with only two nodes.

Thanks,
-David

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/17720132-eb50-4d49-bae5-8970e39b79dc%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/17720132-eb50-4d49-bae5-8970e39b79dc%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/dc2d873d-28ed-40c9-94d9-eb1da37d1caa%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/dc2d873d-28ed-40c9-94d9-eb1da37d1caa%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF-NTSCLBUvPU_19pQYGY86U-4A1N74%2Bd0n2LEeB-JhpQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

David_Ashby · October 20, 2014, 7:49pm

example log line: [DEBUG][action.admin.indices.status] [Red Ronin]
[index][1], node[t60FJtJ-Qk-dQNrxyg8faA], [R], s[STARTED]: failed to
executed
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@36239161]
org.elasticsearch.transport.NodeDisconnectedException:
[Shotgun][inet[/IP:9300]][indices/status/s] disconnected

When the cluster gets into this state, all requests hang waiting for...
something to happen. Each individual node returns 200 when curled locally.
A huge number of this above log line appear at the end of this process --
one for every single shard on the node, which is a huge vomit into my logs.
As soon as a node is restarted the cluster "snaps back" and immediately
fails outstanding requests and begins rebalancing. It even stops responding
to bigdesk requests.

On Monday, October 20, 2014 11:34:36 AM UTC-4, David Ashby wrote:

Hi,

We've been using elasticsearch on AWS for our application for two
purposes: as a search engine for user-created documents, and as a cache for
activity feeds in our application. We made a decision early-on to treat
every customer's content as a distinct index, for full logical separation
of customer data. We have about three hundred indexes in our cluster, with
the default 5-shards/1-replica setup.

Recently, we've had major problems with the cluster "locking up" to
requests and losing track of its nodes. We initially responded by
attempting to remove possible CPU and memory limits, and placed all nodes
in the same AWS placement group, to maximize inter-node bandwidth, all to
no avail. We eventually lost an entire production cluster, resulting in a
decision to split the indexes across two completely independent clusters,
each cluster taking half of the indexes, with application-level logic
determining where the indexes were.

All that is to say: with our setup, are we running into an undocumented
practical limit on the number of indexes or shards in a cluster? It
ends up being around 3000 shards with our setup. Our logs show evidence of
nodes timing out their responses to massive shard status-checks, and it
gets worse the more nodes there are in the cluster. It's also stable
with only two nodes.

Thanks,
-David

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/97d50096-5fd9-40ff-a6a6-900571808c23%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David_Ashby · October 21, 2014, 4:21pm

I should also note that I've been using OpenJDK. I'm currently in the
process of moving to the official Oracle binaries; are there specific
optimizations changes there that help with inter-cluster IO? There's some
hints at that in this very old github-elasticsearch interview
http://exploringelasticsearch.com/github_interview.html.

On Monday, October 20, 2014 3:49:39 PM UTC-4, David Ashby wrote:

example log line: [DEBUG][action.admin.indices.status] [Red Ronin]
[index][1], node[t60FJtJ-Qk-dQNrxyg8faA], [R], s[STARTED]: failed to
executed
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@36239161]
org.elasticsearch.transport.NodeDisconnectedException:
[Shotgun][inet[/IP:9300]][indices/status/s] disconnected

When the cluster gets into this state, all requests hang waiting for...
something to happen. Each individual node returns 200 when curled locally.
A huge number of this above log line appear at the end of this process --
one for every single shard on the node, which is a huge vomit into my logs.
As soon as a node is restarted the cluster "snaps back" and immediately
fails outstanding requests and begins rebalancing. It even stops responding
to bigdesk requests.

On Monday, October 20, 2014 11:34:36 AM UTC-4, David Ashby wrote:

Hi,

We've been using elasticsearch on AWS for our application for two
purposes: as a search engine for user-created documents, and as a cache for
activity feeds in our application. We made a decision early-on to treat
every customer's content as a distinct index, for full logical separation
of customer data. We have about three hundred indexes in our cluster, with
the default 5-shards/1-replica setup.

Recently, we've had major problems with the cluster "locking up" to
requests and losing track of its nodes. We initially responded by
attempting to remove possible CPU and memory limits, and placed all nodes
in the same AWS placement group, to maximize inter-node bandwidth, all to
no avail. We eventually lost an entire production cluster, resulting in a
decision to split the indexes across two completely independent clusters,
each cluster taking half of the indexes, with application-level logic
determining where the indexes were.

All that is to say: with our setup, are we running into an undocumented
practical limit on the number of indexes or shards in a cluster? It
ends up being around 3000 shards with our setup. Our logs show evidence of
nodes timing out their responses to massive shard status-checks, and it
gets worse the more nodes there are in the cluster. It's also stable
with only two nodes.

Thanks,
-David

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7046bd31-5c8a-4e33-9ab4-97cdd8bfd436%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · October 21, 2014, 7:17pm

This has nothing to do with OpenJDK.

IndicesStatusRequest (deprecated, will be removed from future versions) is
a heavy request, there may be something on your machines which takes longer
than 5 seconds, so the request times out.

The IndicesStatus action uses Directories.estimateSize of Lucene. This call
might take some time on large directories, maybe you have many
segments/unoptimized shards/indices.

Jörg

On Tue, Oct 21, 2014 at 6:21 PM, David Ashby delta.mu.alpha@gmail.com
wrote:

I should also note that I've been using OpenJDK. I'm currently in the
process of moving to the official Oracle binaries; are there specific
optimizations changes there that help with inter-cluster IO? There's some
hints at that in this very old github-elasticsearch interview
http://exploringelasticsearch.com/github_interview.html.

On Monday, October 20, 2014 3:49:39 PM UTC-4, David Ashby wrote:

example log line: [DEBUG][action.admin.indices.status] [Red Ronin]
[index][1], node[t60FJtJ-Qk-dQNrxyg8faA], [R], s[STARTED]: failed to
executed [org.elasticsearch.action.admin.indices.status.
IndicesStatusRequest@36239161] org.elasticsearch.transport.NodeDisconnectedException:
[Shotgun][inet[/IP:9300]][indices/status/s] disconnected

When the cluster gets into this state, all requests hang waiting for...
something to happen. Each individual node returns 200 when curled locally.
A huge number of this above log line appear at the end of this process --
one for every single shard on the node, which is a huge vomit into my logs.
As soon as a node is restarted the cluster "snaps back" and immediately
fails outstanding requests and begins rebalancing. It even stops responding
to bigdesk requests.

On Monday, October 20, 2014 11:34:36 AM UTC-4, David Ashby wrote:

Hi,

We've been using elasticsearch on AWS for our application for two
purposes: as a search engine for user-created documents, and as a cache for
activity feeds in our application. We made a decision early-on to treat
every customer's content as a distinct index, for full logical separation
of customer data. We have about three hundred indexes in our cluster, with
the default 5-shards/1-replica setup.

Recently, we've had major problems with the cluster "locking up" to
requests and losing track of its nodes. We initially responded by
attempting to remove possible CPU and memory limits, and placed all nodes
in the same AWS placement group, to maximize inter-node bandwidth, all to
no avail. We eventually lost an entire production cluster, resulting in a
decision to split the indexes across two completely independent clusters,
each cluster taking half of the indexes, with application-level logic
determining where the indexes were.

All that is to say: with our setup, are we running into an undocumented
practical limit on the number of indexes or shards in a cluster? It
ends up being around 3000 shards with our setup. Our logs show evidence of
nodes timing out their responses to massive shard status-checks, and it
gets worse the more nodes there are in the cluster. It's also stable
with only two nodes.

Thanks,
-David

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7046bd31-5c8a-4e33-9ab4-97cdd8bfd436%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7046bd31-5c8a-4e33-9ab4-97cdd8bfd436%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGdzQ5S1x7jObmJO1B4H_V-vobt%2BaO3LSKZj4R8CpMH3w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

jprante · October 21, 2014, 7:25pm

Maybe you are hit by

github.com/elastic/elasticsearch

nodes stats API slower after upgrade 1.2 -> 1.3

opened 06:15PM - 21 Aug 14 UTC

closed 11:46AM - 01 Dec 14 UTC

filippog

feedback_needed :Data Management/Stats

hi, we are observing a problem similar to #5204 in which the nodes stats API is …slower on 1.3 than 1.2 (2s vs 17s) I'm not sure however if the same conditions apply. ``` elastic1016:~$ /usr/bin/time curl -s 'localhost:9200/_nodes/_local/stats?groups=_all' >/dev/null 0.00user 0.00system 0:00.17elapsed 4%CPU (0avgtext+0avgdata 15872maxresident)k 24inputs+0outputs (1major+1080minor)pagefaults 0swaps elastic1016:~$ dpkg -l | grep -i elasticsearch ii elasticsearch 1.2.1 Open Source, Distributed, RESTful Search Engine ``` ``` elastic1006:~$ /usr/bin/time curl -s 'localhost:9200/_nodes/_local/stats?groups=_all' >/dev/null dpkg 0.00user 0.00system 0:02.20elapsed 0%CPU (0avgtext+0avgdata 15840maxresident)k 0inputs+0outputs (0major+1079minor)pagefaults 0swaps elastic1006:~$ dpkg -l | grep -i elasticsearch ii elasticsearch 1.3.2 Open Source, Distributed, RESTful Search Engine elastic1006:~$ ``` let me know if you need more informations!

Jörg

On Tue, Oct 21, 2014 at 9:17 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

This has nothing to do with OpenJDK.

IndicesStatusRequest (deprecated, will be removed from future versions) is
a heavy request, there may be something on your machines which takes longer
than 5 seconds, so the request times out.

The IndicesStatus action uses Directories.estimateSize of Lucene. This
call might take some time on large directories, maybe you have many
segments/unoptimized shards/indices.

Jörg

On Tue, Oct 21, 2014 at 6:21 PM, David Ashby delta.mu.alpha@gmail.com
wrote:

I should also note that I've been using OpenJDK. I'm currently in the
process of moving to the official Oracle binaries; are there specific
optimizations changes there that help with inter-cluster IO? There's some
hints at that in this very old github-elasticsearch interview
http://exploringelasticsearch.com/github_interview.html.

On Monday, October 20, 2014 3:49:39 PM UTC-4, David Ashby wrote:

example log line: [DEBUG][action.admin.indices.status] [Red Ronin]
[index][1], node[t60FJtJ-Qk-dQNrxyg8faA], [R], s[STARTED]: failed to
executed [org.elasticsearch.action.admin.indices.status.
IndicesStatusRequest@36239161] org.elasticsearch.transport.NodeDisconnectedException:
[Shotgun][inet[/IP:9300]][indices/status/s] disconnected

When the cluster gets into this state, all requests hang waiting for...
something to happen. Each individual node returns 200 when curled locally.
A huge number of this above log line appear at the end of this process --
one for every single shard on the node, which is a huge vomit into my logs.
As soon as a node is restarted the cluster "snaps back" and immediately
fails outstanding requests and begins rebalancing. It even stops responding
to bigdesk requests.

On Monday, October 20, 2014 11:34:36 AM UTC-4, David Ashby wrote:

Hi,

We've been using elasticsearch on AWS for our application for two
purposes: as a search engine for user-created documents, and as a cache for
activity feeds in our application. We made a decision early-on to treat
every customer's content as a distinct index, for full logical separation
of customer data. We have about three hundred indexes in our cluster, with
the default 5-shards/1-replica setup.

Recently, we've had major problems with the cluster "locking up" to
requests and losing track of its nodes. We initially responded by
attempting to remove possible CPU and memory limits, and placed all nodes
in the same AWS placement group, to maximize inter-node bandwidth, all to
no avail. We eventually lost an entire production cluster, resulting in a
decision to split the indexes across two completely independent clusters,
each cluster taking half of the indexes, with application-level logic
determining where the indexes were.

All that is to say: with our setup, are we running into an undocumented
practical limit on the number of indexes or shards in a cluster? It
ends up being around 3000 shards with our setup. Our logs show evidence of
nodes timing out their responses to massive shard status-checks, and it
gets worse the more nodes there are in the cluster. It's also stable
with only two nodes.

Thanks,
-David

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7046bd31-5c8a-4e33-9ab4-97cdd8bfd436%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7046bd31-5c8a-4e33-9ab4-97cdd8bfd436%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGcfF687YVG-d0A9YHvzm5HwYDtF5U%2BgwcZR-aU4XF8DA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

David_Ashby · October 21, 2014, 7:29pm

Hmm, maybe. We are using the Elastica PHP library and call getStatus()->getServerStatus()
relatively often (to try and work around elastica's lack of proper error
handling of unreachable nodes) to determine if we have a node we can
connect to or not. If that call maps to IndicesStatusRequest in the end we
might be shooting ourselves in the foot.

On Tuesday, October 21, 2014 3:25:10 PM UTC-4, Jörg Prante wrote:

Maybe you are hit by
nodes stats API slower after upgrade 1.2 -> 1.3 · Issue #7385 · elastic/elasticsearch · GitHub

Jörg

On Tue, Oct 21, 2014 at 9:17 PM, joerg...@gmail.com <javascript:> <
joerg...@gmail.com <javascript:>> wrote:

This has nothing to do with OpenJDK.

IndicesStatusRequest (deprecated, will be removed from future versions)
is a heavy request, there may be something on your machines which takes
longer than 5 seconds, so the request times out.

The IndicesStatus action uses Directories.estimateSize of Lucene. This
call might take some time on large directories, maybe you have many
segments/unoptimized shards/indices.

Jörg

On Tue, Oct 21, 2014 at 6:21 PM, David Ashby <delta.m...@gmail.com
<javascript:>> wrote:

I should also note that I've been using OpenJDK. I'm currently in the
process of moving to the official Oracle binaries; are there specific
optimizations changes there that help with inter-cluster IO? There's some
hints at that in this very old github-elasticsearch interview
http://exploringelasticsearch.com/github_interview.html.

On Monday, October 20, 2014 3:49:39 PM UTC-4, David Ashby wrote:

example log line: [DEBUG][action.admin.indices.status] [Red Ronin]
[index][1], node[t60FJtJ-Qk-dQNrxyg8faA], [R], s[STARTED]: failed to
executed [org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@36239161]
org.elasticsearch.transport.NodeDisconnectedException:
[Shotgun][inet[/IP:9300]][indices/status/s] disconnected

When the cluster gets into this state, all requests hang waiting for...
something to happen. Each individual node returns 200 when curled locally.
A huge number of this above log line appear at the end of this process --
one for every single shard on the node, which is a huge vomit into my logs.
As soon as a node is restarted the cluster "snaps back" and immediately
fails outstanding requests and begins rebalancing. It even stops responding
to bigdesk requests.

On Monday, October 20, 2014 11:34:36 AM UTC-4, David Ashby wrote:

Hi,

We've been using elasticsearch on AWS for our application for two
purposes: as a search engine for user-created documents, and as a cache for
activity feeds in our application. We made a decision early-on to treat
every customer's content as a distinct index, for full logical separation
of customer data. We have about three hundred indexes in our cluster, with
the default 5-shards/1-replica setup.

Recently, we've had major problems with the cluster "locking up" to
requests and losing track of its nodes. We initially responded by
attempting to remove possible CPU and memory limits, and placed all nodes
in the same AWS placement group, to maximize inter-node bandwidth, all to
no avail. We eventually lost an entire production cluster, resulting in a
decision to split the indexes across two completely independent clusters,
each cluster taking half of the indexes, with application-level logic
determining where the indexes were.

All that is to say: with our setup, are we running into an
undocumented practical limit on the number of indexes or shards in
a cluster? It ends up being around 3000 shards with our setup. Our logs
show evidence of nodes timing out their responses to massive shard
status-checks, and it gets worse the more nodes there are in the
cluster. It's also stable with only two nodes.

Thanks,
-David

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7046bd31-5c8a-4e33-9ab4-97cdd8bfd436%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7046bd31-5c8a-4e33-9ab4-97cdd8bfd436%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/51189efa-e739-4e6e-9311-5c7126a28b03%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David_Ashby · October 21, 2014, 9:46pm

Yep, turns out that calls _status on the entire cluster every time it runs.
That might get... uncomfortable. We're submitting a bug report to Elastica
to at the very least get them to update their documentation to mark that
codepath as deprecated.

On Tuesday, October 21, 2014 3:29:47 PM UTC-4, David Ashby wrote:

Hmm, maybe. We are using the Elastica PHP library and call getStatus()->getServerStatus()
relatively often (to try and work around elastica's lack of proper error
handling of unreachable nodes) to determine if we have a node we can
connect to or not. If that call maps to IndicesStatusRequest in the end
we might be shooting ourselves in the foot.

On Tuesday, October 21, 2014 3:25:10 PM UTC-4, Jörg Prante wrote:

Maybe you are hit by
nodes stats API slower after upgrade 1.2 -> 1.3 · Issue #7385 · elastic/elasticsearch · GitHub

Jörg

On Tue, Oct 21, 2014 at 9:17 PM, joerg...@gmail.com joerg...@gmail.com
wrote:

This has nothing to do with OpenJDK.

IndicesStatusRequest (deprecated, will be removed from future versions)
is a heavy request, there may be something on your machines which takes
longer than 5 seconds, so the request times out.

The IndicesStatus action uses Directories.estimateSize of Lucene. This
call might take some time on large directories, maybe you have many
segments/unoptimized shards/indices.

Jörg

On Tue, Oct 21, 2014 at 6:21 PM, David Ashby delta.m...@gmail.com
wrote:

I should also note that I've been using OpenJDK. I'm currently in the
process of moving to the official Oracle binaries; are there specific
optimizations changes there that help with inter-cluster IO? There's some
hints at that in this very old github-elasticsearch interview
http://exploringelasticsearch.com/github_interview.html.

On Monday, October 20, 2014 3:49:39 PM UTC-4, David Ashby wrote:

example log line: [DEBUG][action.admin.indices.status] [Red Ronin]
[index][1], node[t60FJtJ-Qk-dQNrxyg8faA], [R], s[STARTED]: failed to
executed [org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@36239161]
org.elasticsearch.transport.NodeDisconnectedException:
[Shotgun][inet[/IP:9300]][indices/status/s] disconnected

When the cluster gets into this state, all requests hang waiting
for... something to happen. Each individual node returns 200 when curled
locally. A huge number of this above log line appear at the end of this
process -- one for every single shard on the node, which is a huge vomit
into my logs. As soon as a node is restarted the cluster "snaps back" and
immediately fails outstanding requests and begins rebalancing. It even
stops responding to bigdesk requests.

On Monday, October 20, 2014 11:34:36 AM UTC-4, David Ashby wrote:

Hi,

We've been using elasticsearch on AWS for our application for two
purposes: as a search engine for user-created documents, and as a cache for
activity feeds in our application. We made a decision early-on to treat
every customer's content as a distinct index, for full logical separation
of customer data. We have about three hundred indexes in our cluster, with
the default 5-shards/1-replica setup.

Recently, we've had major problems with the cluster "locking up" to
requests and losing track of its nodes. We initially responded by
attempting to remove possible CPU and memory limits, and placed all nodes
in the same AWS placement group, to maximize inter-node bandwidth, all to
no avail. We eventually lost an entire production cluster, resulting in a
decision to split the indexes across two completely independent clusters,
each cluster taking half of the indexes, with application-level logic
determining where the indexes were.

All that is to say: with our setup, are we running into an
undocumented practical limit on the number of indexes or shards in
a cluster? It ends up being around 3000 shards with our setup. Our logs
show evidence of nodes timing out their responses to massive shard
status-checks, and it gets worse the more nodes there are in the
cluster. It's also stable with only two nodes.

Thanks,
-David

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7046bd31-5c8a-4e33-9ab4-97cdd8bfd436%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7046bd31-5c8a-4e33-9ab4-97cdd8bfd436%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c4db397d-90f0-45bd-8f1e-b27581c7fa67%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Upper bounds on the number of indexes in an elastic search cluster Elasticsearch	6	1752	July 6, 2017
Figuring out the optimal number of shards Elasticsearch	6	1654	July 6, 2017
ElasticSearch with > 40 nodes, missing shards and indexing troubles Elasticsearch	11	658	July 6, 2017
Recommended Hardware Specs & Sharding\Index Strategy Elasticsearch	13	806	July 6, 2017
2 clusters versus 1 big cluster? Elasticsearch	6	2711	July 6, 2017

Upper limits on indexes/shards in a cluster

Related topics