Startup issues with ES 1.3.5

Gurvinder_Singh · December 26, 2014, 3:54pm

Do you have master and data node separate or they running on same ES
node process. Another thing does after jstack, process becomes
responsive again or it still remains out of cluster.

On 12/26/2014 04:43 PM, Chris Moore wrote:

I tried your configuration suggestions, but the behavior was no
different. I have attached the jstack output from the troubled node
(master). It didn't appear to indicate anything of note, but I have
attached it.

On Thursday, December 25, 2014 8:33:20 AM UTC-5, Gurvinder Singh wrote:

We might have faced similar problem with ES 1.3.6. The reason we found
was might be due to concurrent merges. These settings have helped us
in fixing the issue.
merge:
    policy:
      max_merge_at_once: 5
      reclaim_deletes_weight: 4.0
      segments_per_tier: 5
indices:
  store:
    throttle:
      max_bytes_per_sec: 40mb # as we have few SATA disk for storage
      type: merge

you can check your hanged process by attaching jstack to it as

jstack -F <pid>

Also once you detach the jstack process become responding again and
joins cluster.  Although it should not happen at all as if disk is the
limitation ES should not stop responding.

- Gurvinder
On 12/24/2014 08:00 PM, Mark Walkom wrote:
> Ok a few things that don't make sense to me;
>
> 1. 10 indexes of only ~220Kb? Are you sure of this? 2. If so why
> not just one index? 3. Is baseball_data.json the data for an entire
> index? If not can you clarify. 4. What java version are you on? 5.
> What monitoring were you using? 6. Can you delete all your data,
> switch monitoring on, start reindexing and then watch what happens?
> Marvel would be ideal for this.
>
> What you are seeing is really, really weird. That is a high shard
> count however given the dataset is small I wouldn't think it'd
> cause problems (but I could be wrong).
>
> On 25 December 2014 at 02:27, Chris Moore <cmo...@perceivant.com
<javascript:>
> <mailto:cmo...@perceivant.com <javascript:>>> wrote:
>
> Attached is the script we've been using to load the data and the
> dataset. This is the mapping and a sample document
>
> { "baseball_1" : { "mappings" : { "team" : { "properties" : { "L" :
> { "type" : "integer", "store" : true }, "W" : { "type" :
> "integer", "store" : true }, "name" : { "type" : "string", "store"
> : true }, "teamID" : { "type" : "string", "store" : true },
> "yearID" : { "type" : "string", "store" : true } } } } } }
>
> {"yearID":"1871", "teamID":"PH1", "W":"21", "L":"7",
> "name":"Philadelphia Athletics"}
>
> On Wednesday, December 24, 2014 10:22:00 AM UTC-5, Chris Moore
> wrote:
>
> We tried many different test setups yesterday. The first setup we
> tried was:
>
> 1 Master, 2 Data nodes 38 indices 10 shards per index 1 replica per
> index 760 total shards (380 primary, 760 total) Each index had
> 2,745 documents Each index was 218.9kb in size (according to the
> _cat/indices API)
>
> We realize that 10 shards per index with only 2 nodes is not a good
> idea, so we changed that and reran the tests.
>
> We changed shards per index to the default of 5 and put 100 indices
> on the 2 boxes and ran into the same issue. It was the same
> dataset, so all other size information is correct.
>
> After that, we turned off one of the data nodes, set replicas to 0
> and shards per index to 1. With the same dataset, I loaded ~440
> indices and ran into the timeout issues with the Master and Data
> nodes just idling.
>
> This is just a test dataset that we came up with to quickly test
> our issues that contains no confidential information. Once we
> figure out the issues affecting this test dataset, we'll try things
> with our real dataset.
>
>
> All of this works fine on ES 1.1.2, but not on 1.3.x (1.3.5 is our
> current test version). We have also tried our real setup on 1.4.1
> to no avail.
>
>
> On Tuesday, December 23, 2014 5:03:30 PM UTC-5, Mark Walkom wrote:
>
> Can you elaborate on your dataset and structure; how many indexes,
> how many shards, how big they are etc.
>
> On 24 December 2014 at 07:36, Chris Moore <cmo...@perceivant.com>
> wrote:
>
> Updating again:
>
> If we reduce the number of shards per node to below ~350, the
> system operates fine. Once we go above that (number_of_indices *
> number_of_shards_per_index * number_of_replicas / number_of_nodes),
> we start running into the described issues.
>
> On Friday, December 12, 2014 2:11:08 PM UTC-5, Chris Moore wrote:
>
> Just a quick update, we duplicated our test environment to see if
> this issue was fixed by upgrading to 1.4.1 instead. We received the
> same errors under 1.4.1.
>
> On Friday, December 5, 2014 4:52:05 PM UTC-5, Chris Moore wrote:
>
> As a followup, I closed all the indices on the cluster. I would
> then open 1 index and optimize it down to 1 segment. I made it
> through ~60% of the indices (and probably ~45% of the data) before
> the same errors showed up in the master log and the same behavior
> resumed.
>
> On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote:
>
> I replied once, but it seems to have disappeared, so if this gets
> double posted, I'm sorry.
>
> We disabled all monitoring when we started looking into the issues
> to ensure there was no external load on ES. Everything we are
> currently seeing is just whatever activity ES generates
> internally.
>
> My understanding regarding optimizing indices is that you shouldn't
> call it explicitly on indices that are regularly updating, rather
> you should let the background merge process handle things. As the
> majority of our indices regularly update, we don't explicitly call
> optimize on them. I can try to call it on them all and see if it
> helps.
>
> As for disk speed, we are currently running ES on SSDs. We have it
> in our roadmap to change that to RAIDed SSDs, but it hasn't been a
> priority as we have been getting acceptable performance thus far.
>
> On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote:
>
> Do you have a monitor tool running?
>
> I recommend to switch it off, and optimize your indices, and then
> update your monitoring tools.
>
> Seems you have many segments/slow disk to get them reported in
> 15s.
>
> Jörg
>
> Am 05.12.2014 16:10 schrieb "Chris Moore" <cmo...@perceivant.com>:
>
> This is running on Amazon EC2 in a VPC on dedicated instances.
> Physical network infrastructure is likely fine. Are there specific
> network issues you think we should look into?
>
> When we are in a problem state, we can communicate between the
> nodes just fine. I can run curl requests to ES (health checks, etc)
> from the master node to the data nodes directly and they return as
> expected. So, there doesn't seem to be a socket exhaustion issue
> (additionally there are no kernel errors being reported).
>
> It feels like there is a queue/buffer filling up somewhere that
> once it has availability again, things start working. But,
> /_cat/thread_pool?v doesn't show anything above 0 (although, when
> we are in the problem state, it doesn't return a response if run on
> master), nodes/hot_threads doesn't show anything going on, etc.
>
> On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey
> wrote:
>
> I would think the network is a prime suspect then, as there is no
> significant difference between 1.2.x and 1.3.x in relation to
> memory usage. And you'd certainly see OOMs in node logs if it was a
> memory issue.
>
> On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore
> wrote:
>
> There is nothing (literally) in the log of either data node after
> the node joined events and nothing in the master log between index
> recovery and the first error message.
>
> There are 0 queries run before the errors start occurring (access
> to the nodes is blocked via a firewall, so the only communications
> are between the nodes). We have 50% of the RAM allocated to the
> heap on each node (4GB each).
>
> This cluster operated without issue under 1.1.2. Did something
> change between 1.1.2 and 1.3.5 that drastically increased idle heap
> requirements?
>
>
> On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey
> wrote:
>
> Generally __ReceiveTimeoutTransp____ortExcepti__on is due to
> network disconnects or a node failing to respond due to heavy load.
> What does the log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it
> has too little heap allocated. Rule of thumb is 1/2 available
> memory but <= 31GB
>
> On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller
> wrote:
>
>
> ES Version: 1.3.5
>
> OS: Ubuntu 14.04.1 LTS
>
> Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at
> AWS
>
> master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)
>
> * *
>
> *After upgrading from ES 1.1.2...*
>
>
> 1. Startup ES on master 2. All nodes join cluster 3. [2014-12-03
> 20:30:54,789][INFO ][gateway ] [ip-10-0-1-18.ec2.internal]
> recovered [157] indices into cluster_state 4. Checked health a few
> times
>
>
> curl -XGET localhost:9200/_cat/health?v
>
> * *
>
> 5. 6 minutes after cluster recovery initiates (and 5:20 after the
> recovery finishes), the log on the master node (10.0.1.18)
> reports:
>
>
> [2014-12-03
> 20:36:57,532][DEBUG][action.__ad____min.cluster.node.stats]
> [ip-10-0-1-18.ec2.internal] failed to execute on node
> [pYi3z5PgRh6msJX_armz_A]
>
>
org.elasticsearch.transport.__Re____ceiveTimeoutTransportExcepti__on____:

>
>
[ip-10-0-1-20.ec2.internal][__in____et[/10.0.1.20:9300]][__cluster/__n__odes/stats/n]

> request_id [17564] timed out after [15001ms]
>
> at
>
org.elasticsearch.transport.__Tr____ansportService$__TimeoutHandler.____run(__TransportService.java:356)

>
>  at
>
java.util.concurrent.__ThreadPoo____lExecutor.runWorker(__ThreadPool____Executor.java:1145)

>
>  at
>
java.util.concurrent.__ThreadPoo____lExecutor$Worker.run(__ThreadPoo____lExecutor.java:615)

>
>  at java.lang.Thread.run(Thread.__ja____va:745)
>
>
> 6. Every 30 seconds or 60 seconds, the above error is reported for
> one or more of the data nodes
>
> 7. During this time, queries (search, index, etc.) don’t return.
> They hang until the error state temporarily resolves itself (a
> varying time around 15-20 minutes) at which point the expected
> result is returned.
>
> -- You received this message because you are subscribed to the
> Google Groups "elasticsearch" group. To unsubscribe from this group
> and stop receiving emails from it, send an email to
> elasticsearc...@googlegroups.__c__om. To view this discussion on
> the web visit
>
https://groups.google.com/d/__ms__gid/elasticsearch/99a45801-__2b9__5-4a21-a6bf-ca724f41bbc2%__40goo__glegroups.com
<https://groups.google.com/d/__ms__gid/elasticsearch/99a45801-__2b9__5-4a21-a6bf-ca724f41bbc2%__40goo__glegroups.com>

>
>
<https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer
<https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer>>.

> For more options, visit https://groups.google.com/d/__op__tout
<https://groups.google.com/d/__op__tout>
> <https://groups.google.com/d/optout
<https://groups.google.com/d/optout>>.
>
> -- You received this message because you are subscribed to the
> Google Groups "elasticsearch" group. To unsubscribe from this group
> and stop receiving emails from it, send an email to
> elasticsearc...@googlegroups.__com. To view this discussion on the
> web visit
>
https://groups.google.com/d/__msgid/elasticsearch/1ad26e40-__a1bf-4302-aba4-551c7d862db1%__40googlegroups.com
<https://groups.google.com/d/__msgid/elasticsearch/1ad26e40-__a1bf-4302-aba4-551c7d862db1%__40googlegroups.com>

>
>
<https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com?utm_medium=email&utm_source=footer
<https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com?utm_medium=email&utm_source=footer>>.

> For more options, visit https://groups.google.com/d/__optout
<https://groups.google.com/d/__optout>
> <https://groups.google.com/d/optout
<https://groups.google.com/d/optout>>.
>
>
> -- You received this message because you are subscribed to the
> Google Groups "elasticsearch" group. To unsubscribe from this group
> and stop receiving emails from it, send an email to
> elasticsearc...@googlegroups.com <javascript:>
> <mailto:elasticsearch+unsubscribe@googlegroups.com <javascript:>>.
To view this
> discussion on the web visit
>
https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com
<https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com>

>
>
<https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com?utm_medium=email&utm_source=footer
<https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com?utm_medium=email&utm_source=footer>>.

>
> For more options, visit https://groups.google.com/d/optout
<https://groups.google.com/d/optout>.
>
>
> -- You received this message because you are subscribed to the
> Google Groups "elasticsearch" group. To unsubscribe from this group
> and stop receiving emails from it, send an email to
> elasticsearc...@googlegroups.com <javascript:>
> <mailto:elasticsearch+unsubscribe@googlegroups.com <javascript:>>.
To view this
> discussion on the web visit
>
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_SPWg%3D9ky9iowVdJjnnsVB_kupCAuVfcyUjr%2BXYhZ6Ng%40mail.gmail.com
<https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_SPWg%3D9ky9iowVdJjnnsVB_kupCAuVfcyUjr%2BXYhZ6Ng%40mail.gmail.com>

>
>
<https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_SPWg%3D9ky9iowVdJjnnsVB_kupCAuVfcyUjr%2BXYhZ6Ng%40mail.gmail.com?utm_medium=email&utm_source=footer
<https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_SPWg%3D9ky9iowVdJjnnsVB_kupCAuVfcyUjr%2BXYhZ6Ng%40mail.gmail.com?utm_medium=email&utm_source=footer>>.

> For more options, visit https://groups.google.com/d/optout
<https://groups.google.com/d/optout>.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com
mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/12685f92-441f-4bac-96ea-c7dd3b0cba47%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/12685f92-441f-4bac-96ea-c7dd3b0cba47%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/549D84BC.6020009%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Chris_Moore · December 29, 2014, 3:26pm

Master and data are separate nodes. The problem node (master) never leaves
the cluster (there are no messages in the logs of the other nodes and
/_cat/health reports it is still there). It will respond to requests that
don't require checking with other nodes for any data (so _cat/health is
fine but /_search is not). Detaching jstack does not fix that behavior.

On Friday, December 26, 2014 10:55:52 AM UTC-5, Gurvinder Singh wrote:

Do you have master and data node separate or they running on same ES
node process. Another thing does after jstack, process becomes
responsive again or it still remains out of cluster.

On 12/26/2014 04:43 PM, Chris Moore wrote:

I tried your configuration suggestions, but the behavior was no
different. I have attached the jstack output from the troubled node
(master). It didn't appear to indicate anything of note, but I have
attached it.

On Thursday, December 25, 2014 8:33:20 AM UTC-5, Gurvinder Singh wrote:
We might have faced similar problem with ES 1.3.6. The reason we 

found

was might be due to concurrent merges. These settings have helped us 
in fixing the issue. 
merge: 
    policy: 
      max_merge_at_once: 5 
      reclaim_deletes_weight: 4.0 
      segments_per_tier: 5 
indices: 
  store: 
    throttle: 
      max_bytes_per_sec: 40mb # as we have few SATA disk for storage 
      type: merge 

you can check your hanged process by attaching jstack to it as 

jstack -F <pid> 

Also once you detach the jstack process become responding again and 
joins cluster.  Although it should not happen at all as if disk is

the

limitation ES should not stop responding. 

- Gurvinder 
On 12/24/2014 08:00 PM, Mark Walkom wrote: 
> Ok a few things that don't make sense to me; 
> 
> 1. 10 indexes of only ~220Kb? Are you sure of this? 2. If so why 
> not just one index? 3. Is baseball_data.json the data for an

entire

> index? If not can you clarify. 4. What java version are you on? 5. 
> What monitoring were you using? 6. Can you delete all your data, 
> switch monitoring on, start reindexing and then watch what

happens?

> Marvel would be ideal for this. 
> 
> What you are seeing is really, really weird. That is a high shard 
> count however given the dataset is small I wouldn't think it'd 
> cause problems (but I could be wrong). 
> 
> On 25 December 2014 at 02:27, Chris Moore <cmo...@perceivant.com 
<javascript:> 
> <mailto:cmo...@perceivant.com <javascript:>>> wrote: 
> 
> Attached is the script we've been using to load the data and the 
> dataset. This is the mapping and a sample document 
> 
> { "baseball_1" : { "mappings" : { "team" : { "properties" : { "L"

:

> { "type" : "integer", "store" : true }, "W" : { "type" : 
> "integer", "store" : true }, "name" : { "type" : "string", "store" 
> : true }, "teamID" : { "type" : "string", "store" : true }, 
> "yearID" : { "type" : "string", "store" : true } } } } } } 
> 
> {"yearID":"1871", "teamID":"PH1", "W":"21", "L":"7", 
> "name":"Philadelphia Athletics"} 
> 
> On Wednesday, December 24, 2014 10:22:00 AM UTC-5, Chris Moore 
> wrote: 
> 
> We tried many different test setups yesterday. The first setup we 
> tried was: 
> 
> 1 Master, 2 Data nodes 38 indices 10 shards per index 1 replica

per

> index 760 total shards (380 primary, 760 total) Each index had 
> 2,745 documents Each index was 218.9kb in size (according to the 
> _cat/indices API) 
> 
> We realize that 10 shards per index with only 2 nodes is not a

good

> idea, so we changed that and reran the tests. 
> 
> We changed shards per index to the default of 5 and put 100

indices

> on the 2 boxes and ran into the same issue. It was the same 
> dataset, so all other size information is correct. 
> 
> After that, we turned off one of the data nodes, set replicas to 0 
> and shards per index to 1. With the same dataset, I loaded ~440 
> indices and ran into the timeout issues with the Master and Data 
> nodes just idling. 
> 
> This is just a test dataset that we came up with to quickly test 
> our issues that contains no confidential information. Once we 
> figure out the issues affecting this test dataset, we'll try

things

> with our real dataset. 
> 
> 
> All of this works fine on ES 1.1.2, but not on 1.3.x (1.3.5 is our 
> current test version). We have also tried our real setup on 1.4.1 
> to no avail. 
> 
> 
> On Tuesday, December 23, 2014 5:03:30 PM UTC-5, Mark Walkom wrote: 
> 
> Can you elaborate on your dataset and structure; how many indexes, 
> how many shards, how big they are etc. 
> 
> On 24 December 2014 at 07:36, Chris Moore <cmo...@perceivant.com> 
> wrote: 
> 
> Updating again: 
> 
> If we reduce the number of shards per node to below ~350, the 
> system operates fine. Once we go above that (number_of_indices * 
> number_of_shards_per_index * number_of_replicas /

number_of_nodes),

> we start running into the described issues. 
> 
> On Friday, December 12, 2014 2:11:08 PM UTC-5, Chris Moore wrote: 
> 
> Just a quick update, we duplicated our test environment to see if 
> this issue was fixed by upgrading to 1.4.1 instead. We received

the

> same errors under 1.4.1. 
> 
> On Friday, December 5, 2014 4:52:05 PM UTC-5, Chris Moore wrote: 
> 
> As a followup, I closed all the indices on the cluster. I would 
> then open 1 index and optimize it down to 1 segment. I made it 
> through ~60% of the indices (and probably ~45% of the data) before 
> the same errors showed up in the master log and the same behavior 
> resumed. 
> 
> On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote: 
> 
> I replied once, but it seems to have disappeared, so if this gets 
> double posted, I'm sorry. 
> 
> We disabled all monitoring when we started looking into the issues 
> to ensure there was no external load on ES. Everything we are 
> currently seeing is just whatever activity ES generates 
> internally. 
> 
> My understanding regarding optimizing indices is that you

shouldn't

> call it explicitly on indices that are regularly updating, rather 
> you should let the background merge process handle things. As the 
> majority of our indices regularly update, we don't explicitly call 
> optimize on them. I can try to call it on them all and see if it 
> helps. 
> 
> As for disk speed, we are currently running ES on SSDs. We have it 
> in our roadmap to change that to RAIDed SSDs, but it hasn't been a 
> priority as we have been getting acceptable performance thus far. 
> 
> On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote: 
> 
> Do you have a monitor tool running? 
> 
> I recommend to switch it off, and optimize your indices, and then 
> update your monitoring tools. 
> 
> Seems you have many segments/slow disk to get them reported in 
> 15s. 
> 
> Jörg 
> 
> Am 05.12.2014 16:10 schrieb "Chris Moore" <cmo...@perceivant.com>:

> 
> This is running on Amazon EC2 in a VPC on dedicated instances. 
> Physical network infrastructure is likely fine. Are there specific 
> network issues you think we should look into? 
> 
> When we are in a problem state, we can communicate between the 
> nodes just fine. I can run curl requests to ES (health checks,

etc)

> from the master node to the data nodes directly and they return as 
> expected. So, there doesn't seem to be a socket exhaustion issue 
> (additionally there are no kernel errors being reported). 
> 
> It feels like there is a queue/buffer filling up somewhere that 
> once it has availability again, things start working. But, 
> /_cat/thread_pool?v doesn't show anything above 0 (although, when 
> we are in the problem state, it doesn't return a response if run

on

> master), nodes/hot_threads doesn't show anything going on, etc. 
> 
> On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey 
> wrote: 
> 
> I would think the network is a prime suspect then, as there is no 
> significant difference between 1.2.x and 1.3.x in relation to 
> memory usage. And you'd certainly see OOMs in node logs if it was

a

> memory issue. 
> 
> On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore 
> wrote: 
> 
> There is nothing (literally) in the log of either data node after 
> the node joined events and nothing in the master log between index 
> recovery and the first error message. 
> 
> There are 0 queries run before the errors start occurring (access 
> to the nodes is blocked via a firewall, so the only communications 
> are between the nodes). We have 50% of the RAM allocated to the 
> heap on each node (4GB each). 
> 
> This cluster operated without issue under 1.1.2. Did something 
> change between 1.1.2 and 1.3.5 that drastically increased idle

heap

> requirements? 
> 
> 
> On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey 
> wrote: 
> 
> Generally __ReceiveTimeoutTransp____ortExcepti__on is due to 
> network disconnects or a node failing to respond due to heavy

load.

> What does the log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it 
> has too little heap allocated. Rule of thumb is 1/2 available 
> memory but <= 31GB 
> 
> On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller 
> wrote: 
> 
> 
> ES Version: 1.3.5 
> 
> OS: Ubuntu 14.04.1 LTS 
> 
> Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at 
> AWS 
> 
> master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20) 
> 
> * * 
> 
> *After upgrading from ES 1.1.2...* 
> 
> 
> 1. Startup ES on master 2. All nodes join cluster 3. [2014-12-03 
> 20:30:54,789][INFO ][gateway ] [ip-10-0-1-18.ec2.internal] 
> recovered [157] indices into cluster_state 4. Checked health a few 
> times 
> 
> 
> curl -XGET localhost:9200/_cat/health?v 
> 
> * * 
> 
> 5. 6 minutes after cluster recovery initiates (and 5:20 after the 
> recovery finishes), the log on the master node (10.0.1.18) 
> reports: 
> 
> 
> [2014-12-03 
> 20:36:57,532][DEBUG][action.__ad____min.cluster.node.stats] 
> [ip-10-0-1-18.ec2.internal] failed to execute on node 
> [pYi3z5PgRh6msJX_armz_A] 
> 
>

org.elasticsearch.transport.Re____ceiveTimeoutTransportExcepti__on__:

> 
>

[ip-10-0-1-20.ec2.internal][__in____et[/10.0.1.20:9300]][__cluster/__n__odes/stats/n]

> request_id [17564] timed out after [15001ms] 
> 
> at 
>

org.elasticsearch.transport.__Tr____ansportService$__TimeoutHandler.____run(__TransportService.java:356)

> 
>  at 
>

java.util.concurrent.__ThreadPoo____lExecutor.runWorker(__ThreadPool____Executor.java:1145)

> 
>  at 
>

java.util.concurrent.__ThreadPoo____lExecutor$Worker.run(__ThreadPoo____lExecutor.java:615)

> 
>  at java.lang.Thread.run(Thread.__ja____va:745) 
> 
> 
> 6. Every 30 seconds or 60 seconds, the above error is reported for 
> one or more of the data nodes 
> 
> 7. During this time, queries (search, index, etc.) don’t return. 
> They hang until the error state temporarily resolves itself (a 
> varying time around 15-20 minutes) at which point the expected 
> result is returned. 
> 
> -- You received this message because you are subscribed to the 
> Google Groups "elasticsearch" group. To unsubscribe from this

group

> and stop receiving emails from it, send an email to 
> elasticsearc...@googlegroups.__c__om. To view this discussion on 
> the web visit 
>

https://groups.google.com/d/__ms__gid/elasticsearch/99a45801-__2b9__5-4a21-a6bf-ca724f41bbc2%__40goo__glegroups.com

https://groups.google.com/d/__ms__gid/elasticsearch/99a45801-__2b9__5-4a21-a6bf-ca724f41bbc2%__40goo__glegroups.com>

> 
> 
<

https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer

https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer>>.

> For more options, visit https://groups.google.com/d/__op__tout 
<https://groups.google.com/d/__op__tout> 
> <https://groups.google.com/d/optout 
<https://groups.google.com/d/optout>>. 
> 
> -- You received this message because you are subscribed to the 
> Google Groups "elasticsearch" group. To unsubscribe from this

group

> and stop receiving emails from it, send an email to 
> elasticsearc...@googlegroups.__com. To view this discussion on the 
> web visit 
>

https://groups.google.com/d/__msgid/elasticsearch/1ad26e40-__a1bf-4302-aba4-551c7d862db1%__40googlegroups.com

https://groups.google.com/d/__msgid/elasticsearch/1ad26e40-__a1bf-4302-aba4-551c7d862db1%__40googlegroups.com>

> 
> 
<

https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com?utm_medium=email&utm_source=footer

https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com?utm_medium=email&utm_source=footer>>.

> For more options, visit https://groups.google.com/d/__optout 
<https://groups.google.com/d/__optout> 
> <https://groups.google.com/d/optout 
<https://groups.google.com/d/optout>>. 
> 
> 
> -- You received this message because you are subscribed to the 
> Google Groups "elasticsearch" group. To unsubscribe from this

group

> and stop receiving emails from it, send an email to 
> elasticsearc...@googlegroups.com <javascript:> 
> <mailto:elasticsearch+unsubscribe@googlegroups.com <javascript:>

<javascript:>>.

To view this 
> discussion on the web visit 
>

https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com

https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com>

> 
> 
<

https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com?utm_medium=email&utm_source=footer

https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com?utm_medium=email&utm_source=footer>>.

> 
> For more options, visit https://groups.google.com/d/optout 
<https://groups.google.com/d/optout>. 
> 
> 
> -- You received this message because you are subscribed to the 
> Google Groups "elasticsearch" group. To unsubscribe from this

group

> and stop receiving emails from it, send an email to 
> elasticsearc...@googlegroups.com <javascript:> 
> <mailto:elasticsearch+unsubscribe@googlegroups.com <javascript:>

<javascript:>>.

To view this 
> discussion on the web visit 
>

https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_SPWg%3D9ky9iowVdJjnnsVB_kupCAuVfcyUjr%2BXYhZ6Ng%40mail.gmail.com

https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_SPWg%3D9ky9iowVdJjnnsVB_kupCAuVfcyUjr%2BXYhZ6Ng%40mail.gmail.com>

> 
> 
<

https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_SPWg%3D9ky9iowVdJjnnsVB_kupCAuVfcyUjr%2BXYhZ6Ng%40mail.gmail.com?utm_medium=email&utm_source=footer

https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_SPWg%3D9ky9iowVdJjnnsVB_kupCAuVfcyUjr%2BXYhZ6Ng%40mail.gmail.com?utm_medium=email&utm_source=footer>>.

> For more options, visit https://groups.google.com/d/optout 
<https://groups.google.com/d/optout>. 
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>
<mailto:elasticsearch+unsubscribe@googlegroups.com <javascript:>>.
To view this discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/12685f92-441f-4bac-96ea-c7dd3b0cba47%40googlegroups.com

<
https://groups.google.com/d/msgid/elasticsearch/12685f92-441f-4bac-96ea-c7dd3b0cba47%40googlegroups.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7dd123c2-93e9-4295-9d10-405b5c82669e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Failed to execute on node while the cluster running about 20 minutes later Elasticsearch	4	1795	January 31, 2019
ES Prod cluster Receive Timeout Transport Exception Elasticsearch	7	6620	July 5, 2017
TransportNodesStatsAction Elasticsearch	3	720	January 25, 2019
Unstable cluster Elasticsearch	11	2036	July 6, 2017
Any way to exclude not responding node from running ES cluster? Elasticsearch	3	1262	July 6, 2017

Startup issues with ES 1.3.5

Related topics