Startup issues with ES 1.3.5

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway ]
    [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after the recovery
    finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is reported for one or
    more of the data nodes

  2. During this time, queries (search, index, etc.) don’t return. They hang
    until the error state temporarily resolves itself (a varying time around
    15-20 minutes) at which point the expected result is returned.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/edebdf27-a766-4ba0-814d-6946d9181f68%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Generally ReceiveTimeoutTransportException is due to network disconnects or
a node failing to respond due to heavy load. What does the log
of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
allocated. Rule of thumb is 1/2 available memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway ]
    [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after the recovery
    finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is reported for one or
    more of the data nodes

  2. During this time, queries (search, index, etc.) don’t return. They hang
    until the error state temporarily resolves itself (a varying time around
    15-20 minutes) at which point the expected result is returned.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5b3a2f05-af84-4334-a924-4b828c33138c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

There is nothing (literally) in the log of either data node after the node
joined events and nothing in the master log between index recovery and the
first error message.

There are 0 queries run before the errors start occurring (access to the
nodes is blocked via a firewall, so the only communications are between the
nodes). We have 50% of the RAM allocated to the heap on each node (4GB
each).

This cluster operated without issue under 1.1.2. Did something change
between 1.1.2 and 1.3.5 that drastically increased idle heap requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote:

Generally ReceiveTimeoutTransportException is due to network disconnects
or a node failing to respond due to heavy load. What does the log
of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
allocated. Rule of thumb is 1/2 available memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway ]
    [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after the
    recovery finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is reported for one or
    more of the data nodes

  2. During this time, queries (search, index, etc.) don’t return. They
    hang until the error state temporarily resolves itself (a varying time
    around 15-20 minutes) at which point the expected result is returned.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/50dfaccc-b8c6-4f72-afad-d641078d42e5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I would think the network is a prime suspect then, as there is no
significant difference between 1.2.x and 1.3.x in relation to memory usage.
And you'd certainly see OOMs in node logs if it was a memory issue.

On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:

There is nothing (literally) in the log of either data node after the node
joined events and nothing in the master log between index recovery and the
first error message.

There are 0 queries run before the errors start occurring (access to the
nodes is blocked via a firewall, so the only communications are between the
nodes). We have 50% of the RAM allocated to the heap on each node (4GB
each).

This cluster operated without issue under 1.1.2. Did something change
between 1.1.2 and 1.3.5 that drastically increased idle heap requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote:

Generally ReceiveTimeoutTransportException is due to network disconnects
or a node failing to respond due to heavy load. What does the log
of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
allocated. Rule of thumb is 1/2 available memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway ]
    [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after the
    recovery finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is reported for one
    or more of the data nodes

  2. During this time, queries (search, index, etc.) don’t return. They
    hang until the error state temporarily resolves itself (a varying time
    around 15-20 minutes) at which point the expected result is returned.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c17de300-8b29-4ef9-ae64-d723cc0ad45c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

This is running on Amazon EC2 in a VPC on dedicated instances. Physical
network infrastructure is likely fine. Are there specific network issues
you think we should look into?

When we are in a problem state, we can communicate between the nodes just
fine. I can run curl requests to ES (health checks, etc) from the master
node to the data nodes directly and they return as expected. So, there
doesn't seem to be a socket exhaustion issue (additionally there are no
kernel errors being reported).

It feels like there is a queue/buffer filling up somewhere that once it has
availability again, things start working. But, /_cat/thread_pool?v doesn't
show anything above 0 (although, when we are in the problem state, it
doesn't return a response if run on master), nodes/hot_threads doesn't show
anything going on, etc.

On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote:

I would think the network is a prime suspect then, as there is no
significant difference between 1.2.x and 1.3.x in relation to memory usage.
And you'd certainly see OOMs in node logs if it was a memory issue.

On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:

There is nothing (literally) in the log of either data node after the
node joined events and nothing in the master log between index recovery and
the first error message.

There are 0 queries run before the errors start occurring (access to the
nodes is blocked via a firewall, so the only communications are between the
nodes). We have 50% of the RAM allocated to the heap on each node (4GB
each).

This cluster operated without issue under 1.1.2. Did something change
between 1.1.2 and 1.3.5 that drastically increased idle heap requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote:

Generally ReceiveTimeoutTransportException is due to network disconnects
or a node failing to respond due to heavy load. What does the log
of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
allocated. Rule of thumb is 1/2 available memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway ]
    [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after the
    recovery finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is reported for one
    or more of the data nodes

  2. During this time, queries (search, index, etc.) don’t return. They
    hang until the error state temporarily resolves itself (a varying time
    around 15-20 minutes) at which point the expected result is returned.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I see. When you say the data nodes have literally nothing in their logs,
you mean they aren't logging anything or just nothing interesting?

On Friday, December 5, 2014 7:10:13 AM UTC-8, Chris Moore wrote:

This is running on Amazon EC2 in a VPC on dedicated instances. Physical
network infrastructure is likely fine. Are there specific network issues
you think we should look into?

When we are in a problem state, we can communicate between the nodes just
fine. I can run curl requests to ES (health checks, etc) from the master
node to the data nodes directly and they return as expected. So, there
doesn't seem to be a socket exhaustion issue (additionally there are no
kernel errors being reported).

It feels like there is a queue/buffer filling up somewhere that once it
has availability again, things start working. But, /_cat/thread_pool?v
doesn't show anything above 0 (although, when we are in the problem state,
it doesn't return a response if run on master), nodes/hot_threads doesn't
show anything going on, etc.

On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote:

I would think the network is a prime suspect then, as there is no
significant difference between 1.2.x and 1.3.x in relation to memory usage.
And you'd certainly see OOMs in node logs if it was a memory issue.

On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:

There is nothing (literally) in the log of either data node after the
node joined events and nothing in the master log between index recovery and
the first error message.

There are 0 queries run before the errors start occurring (access to the
nodes is blocked via a firewall, so the only communications are between the
nodes). We have 50% of the RAM allocated to the heap on each node (4GB
each).

This cluster operated without issue under 1.1.2. Did something change
between 1.1.2 and 1.3.5 that drastically increased idle heap requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote:

Generally ReceiveTimeoutTransportException is due to network
disconnects or a node failing to respond due to heavy load. What does the
log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
allocated. Rule of thumb is 1/2 available memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway ]
    [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after the
    recovery finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is reported for one
    or more of the data nodes

  2. During this time, queries (search, index, etc.) don’t return. They
    hang until the error state temporarily resolves itself (a varying time
    around 15-20 minutes) at which point the expected result is returned.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/05c7e14f-9161-40d8-ab00-a9d1f4ae5f50%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Do you have a monitor tool running?

I recommend to switch it off, and optimize your indices, and then update
your monitoring tools.

Seems you have many segments/slow disk to get them reported in 15s.

Jörg
Am 05.12.2014 16:10 schrieb "Chris Moore" cmoore@perceivant.com:

This is running on Amazon EC2 in a VPC on dedicated instances. Physical
network infrastructure is likely fine. Are there specific network issues
you think we should look into?

When we are in a problem state, we can communicate between the nodes just
fine. I can run curl requests to ES (health checks, etc) from the master
node to the data nodes directly and they return as expected. So, there
doesn't seem to be a socket exhaustion issue (additionally there are no
kernel errors being reported).

It feels like there is a queue/buffer filling up somewhere that once it
has availability again, things start working. But, /_cat/thread_pool?v
doesn't show anything above 0 (although, when we are in the problem state,
it doesn't return a response if run on master), nodes/hot_threads doesn't
show anything going on, etc.

On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote:

I would think the network is a prime suspect then, as there is no
significant difference between 1.2.x and 1.3.x in relation to memory usage.
And you'd certainly see OOMs in node logs if it was a memory issue.

On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:

There is nothing (literally) in the log of either data node after the
node joined events and nothing in the master log between index recovery and
the first error message.

There are 0 queries run before the errors start occurring (access to the
nodes is blocked via a firewall, so the only communications are between the
nodes). We have 50% of the RAM allocated to the heap on each node (4GB
each).

This cluster operated without issue under 1.1.2. Did something change
between 1.1.2 and 1.3.5 that drastically increased idle heap requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote:

Generally ReceiveTimeoutTransportException is due to network
disconnects or a node failing to respond due to heavy load. What does the
log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
allocated. Rule of thumb is 1/2 available memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway ]
    [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after the
    recovery finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at org.elasticsearch.transport.TransportService$TimeoutHandler.run(
TransportService.java:356)

at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is reported for one
    or more of the data nodes

  2. During this time, queries (search, index, etc.) don’t return. They
    hang until the error state temporarily resolves itself (a varying time
    around 15-20 minutes) at which point the expected result is returned.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHs80VBv%2BBz0G6bQKWHZd-gG8G2aXmiS39OFWaEW1su4A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

I mean they aren't logging anything (until I send the shutdown command, a
node leaves, etc). It's not that I feel like there's an issue with the
logging; the data nodes just have nothing to log because everything seems
fine to them. I have attached a log from one of the data nodes showing this
with a notation of when the master node first reported an error and when I
issued SIGTERM to all of the ES instances.

On Friday, December 5, 2014 2:39:27 PM UTC-5, Support Monkey wrote:

I see. When you say the data nodes have literally nothing in their logs,
you mean they aren't logging anything or just nothing interesting?

On Friday, December 5, 2014 7:10:13 AM UTC-8, Chris Moore wrote:

This is running on Amazon EC2 in a VPC on dedicated instances. Physical
network infrastructure is likely fine. Are there specific network issues
you think we should look into?

When we are in a problem state, we can communicate between the nodes just
fine. I can run curl requests to ES (health checks, etc) from the master
node to the data nodes directly and they return as expected. So, there
doesn't seem to be a socket exhaustion issue (additionally there are no
kernel errors being reported).

It feels like there is a queue/buffer filling up somewhere that once it
has availability again, things start working. But, /_cat/thread_pool?v
doesn't show anything above 0 (although, when we are in the problem state,
it doesn't return a response if run on master), nodes/hot_threads doesn't
show anything going on, etc.

On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote:

I would think the network is a prime suspect then, as there is no
significant difference between 1.2.x and 1.3.x in relation to memory usage.
And you'd certainly see OOMs in node logs if it was a memory issue.

On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:

There is nothing (literally) in the log of either data node after the
node joined events and nothing in the master log between index recovery and
the first error message.

There are 0 queries run before the errors start occurring (access to
the nodes is blocked via a firewall, so the only communications are between
the nodes). We have 50% of the RAM allocated to the heap on each node (4GB
each).

This cluster operated without issue under 1.1.2. Did something change
between 1.1.2 and 1.3.5 that drastically increased idle heap requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote:

Generally ReceiveTimeoutTransportException is due to network
disconnects or a node failing to respond due to heavy load. What does the
log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
allocated. Rule of thumb is 1/2 available memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway ]
    [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after the
    recovery finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is reported for
    one or more of the data nodes

  2. During this time, queries (search, index, etc.) don’t return. They
    hang until the error state temporarily resolves itself (a varying time
    around 15-20 minutes) at which point the expected result is returned.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0da74000-3147-4f42-9513-f77b7c419a22%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

We disabled all monitoring before requesting help to ensure there was no
load on ES from anything other than what it does internally.

My understanding on using optimize was it shouldn't be done on indices that
are regularly updated and to just let the background merge process handle
it. The majority of our indices receive regular updates, so we don't
explicitly optimize them. I can call optimize on all of them and see if it
helps the issue.

As for disk speed, we're using SSDs on all nodes. We have plans to switch
to RAIDed SSDs, but haven't had the need, yet.

On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote:

Do you have a monitor tool running?

I recommend to switch it off, and optimize your indices, and then update
your monitoring tools.

Seems you have many segments/slow disk to get them reported in 15s.

Jörg
Am 05.12.2014 16:10 schrieb "Chris Moore" <cmo...@perceivant.com
<javascript:>>:

This is running on Amazon EC2 in a VPC on dedicated instances. Physical
network infrastructure is likely fine. Are there specific network issues
you think we should look into?

When we are in a problem state, we can communicate between the nodes just
fine. I can run curl requests to ES (health checks, etc) from the master
node to the data nodes directly and they return as expected. So, there
doesn't seem to be a socket exhaustion issue (additionally there are no
kernel errors being reported).

It feels like there is a queue/buffer filling up somewhere that once it
has availability again, things start working. But, /_cat/thread_pool?v
doesn't show anything above 0 (although, when we are in the problem state,
it doesn't return a response if run on master), nodes/hot_threads doesn't
show anything going on, etc.

On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote:

I would think the network is a prime suspect then, as there is no
significant difference between 1.2.x and 1.3.x in relation to memory usage.
And you'd certainly see OOMs in node logs if it was a memory issue.

On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:

There is nothing (literally) in the log of either data node after the
node joined events and nothing in the master log between index recovery and
the first error message.

There are 0 queries run before the errors start occurring (access to
the nodes is blocked via a firewall, so the only communications are between
the nodes). We have 50% of the RAM allocated to the heap on each node (4GB
each).

This cluster operated without issue under 1.1.2. Did something change
between 1.1.2 and 1.3.5 that drastically increased idle heap requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote:

Generally ReceiveTimeoutTransportException is due to network
disconnects or a node failing to respond due to heavy load. What does the
log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
allocated. Rule of thumb is 1/2 available memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway ]
    [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after the
    recovery finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at org.elasticsearch.transport.TransportService$TimeoutHandler.run(
TransportService.java:356)

at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is reported for
    one or more of the data nodes

  2. During this time, queries (search, index, etc.) don’t return. They
    hang until the error state temporarily resolves itself (a varying time
    around 15-20 minutes) at which point the expected result is returned.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/587e69b9-9b9f-4b21-8472-7af8c8d2fcf7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I replied once, but it seems to have disappeared, so if this gets double
posted, I'm sorry.

We disabled all monitoring when we started looking into the issues to
ensure there was no external load on ES. Everything we are currently seeing
is just whatever activity ES generates internally.

My understanding regarding optimizing indices is that you shouldn't call it
explicitly on indices that are regularly updating, rather you should let
the background merge process handle things. As the majority of our indices
regularly update, we don't explicitly call optimize on them. I can try to
call it on them all and see if it helps.

As for disk speed, we are currently running ES on SSDs. We have it in our
roadmap to change that to RAIDed SSDs, but it hasn't been a priority as we
have been getting acceptable performance thus far.

On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote:

Do you have a monitor tool running?

I recommend to switch it off, and optimize your indices, and then update
your monitoring tools.

Seems you have many segments/slow disk to get them reported in 15s.

Jörg
Am 05.12.2014 16:10 schrieb "Chris Moore" <cmo...@perceivant.com
<javascript:>>:

This is running on Amazon EC2 in a VPC on dedicated instances. Physical
network infrastructure is likely fine. Are there specific network issues
you think we should look into?

When we are in a problem state, we can communicate between the nodes just
fine. I can run curl requests to ES (health checks, etc) from the master
node to the data nodes directly and they return as expected. So, there
doesn't seem to be a socket exhaustion issue (additionally there are no
kernel errors being reported).

It feels like there is a queue/buffer filling up somewhere that once it
has availability again, things start working. But, /_cat/thread_pool?v
doesn't show anything above 0 (although, when we are in the problem state,
it doesn't return a response if run on master), nodes/hot_threads doesn't
show anything going on, etc.

On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote:

I would think the network is a prime suspect then, as there is no
significant difference between 1.2.x and 1.3.x in relation to memory usage.
And you'd certainly see OOMs in node logs if it was a memory issue.

On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:

There is nothing (literally) in the log of either data node after the
node joined events and nothing in the master log between index recovery and
the first error message.

There are 0 queries run before the errors start occurring (access to
the nodes is blocked via a firewall, so the only communications are between
the nodes). We have 50% of the RAM allocated to the heap on each node (4GB
each).

This cluster operated without issue under 1.1.2. Did something change
between 1.1.2 and 1.3.5 that drastically increased idle heap requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote:

Generally ReceiveTimeoutTransportException is due to network
disconnects or a node failing to respond due to heavy load. What does the
log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
allocated. Rule of thumb is 1/2 available memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway ]
    [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after the
    recovery finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at org.elasticsearch.transport.TransportService$TimeoutHandler.run(
TransportService.java:356)

at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is reported for
    one or more of the data nodes

  2. During this time, queries (search, index, etc.) don’t return. They
    hang until the error state temporarily resolves itself (a varying time
    around 15-20 minutes) at which point the expected result is returned.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/32c75f79-1326-47c4-8d28-e28361873ed6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

As a followup, I closed all the indices on the cluster. I would then open 1
index and optimize it down to 1 segment. I made it through ~60% of the
indices (and probably ~45% of the data) before the same errors showed up in
the master log and the same behavior resumed.

On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote:

I replied once, but it seems to have disappeared, so if this gets double
posted, I'm sorry.

We disabled all monitoring when we started looking into the issues to
ensure there was no external load on ES. Everything we are currently seeing
is just whatever activity ES generates internally.

My understanding regarding optimizing indices is that you shouldn't call
it explicitly on indices that are regularly updating, rather you should let
the background merge process handle things. As the majority of our indices
regularly update, we don't explicitly call optimize on them. I can try to
call it on them all and see if it helps.

As for disk speed, we are currently running ES on SSDs. We have it in our
roadmap to change that to RAIDed SSDs, but it hasn't been a priority as we
have been getting acceptable performance thus far.

On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote:

Do you have a monitor tool running?

I recommend to switch it off, and optimize your indices, and then update
your monitoring tools.

Seems you have many segments/slow disk to get them reported in 15s.

Jörg
Am 05.12.2014 16:10 schrieb "Chris Moore" cmo...@perceivant.com:

This is running on Amazon EC2 in a VPC on dedicated instances. Physical
network infrastructure is likely fine. Are there specific network issues
you think we should look into?

When we are in a problem state, we can communicate between the nodes
just fine. I can run curl requests to ES (health checks, etc) from the
master node to the data nodes directly and they return as expected. So,
there doesn't seem to be a socket exhaustion issue (additionally there are
no kernel errors being reported).

It feels like there is a queue/buffer filling up somewhere that once it
has availability again, things start working. But, /_cat/thread_pool?v
doesn't show anything above 0 (although, when we are in the problem state,
it doesn't return a response if run on master), nodes/hot_threads doesn't
show anything going on, etc.

On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote:

I would think the network is a prime suspect then, as there is no
significant difference between 1.2.x and 1.3.x in relation to memory usage.
And you'd certainly see OOMs in node logs if it was a memory issue.

On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:

There is nothing (literally) in the log of either data node after the
node joined events and nothing in the master log between index recovery and
the first error message.

There are 0 queries run before the errors start occurring (access to
the nodes is blocked via a firewall, so the only communications are between
the nodes). We have 50% of the RAM allocated to the heap on each node (4GB
each).

This cluster operated without issue under 1.1.2. Did something change
between 1.1.2 and 1.3.5 that drastically increased idle heap requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote:

Generally ReceiveTimeoutTransportException is due to network
disconnects or a node failing to respond due to heavy load. What does the
log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
allocated. Rule of thumb is 1/2 available memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway ]
    [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after the
    recovery finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at org.elasticsearch.transport.TransportService$TimeoutHandler.run(
TransportService.java:356)

at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is reported for
    one or more of the data nodes

  2. During this time, queries (search, index, etc.) don’t return.
    They hang until the error state temporarily resolves itself (a varying time
    around 15-20 minutes) at which point the expected result is returned.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e52e7761-e2a9-4567-ab62-84f52f353818%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Just a quick update, we duplicated our test environment to see if this
issue was fixed by upgrading to 1.4.1 instead. We received the same errors
under 1.4.1.

On Friday, December 5, 2014 4:52:05 PM UTC-5, Chris Moore wrote:

As a followup, I closed all the indices on the cluster. I would then open
1 index and optimize it down to 1 segment. I made it through ~60% of the
indices (and probably ~45% of the data) before the same errors showed up in
the master log and the same behavior resumed.

On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote:

I replied once, but it seems to have disappeared, so if this gets double
posted, I'm sorry.

We disabled all monitoring when we started looking into the issues to
ensure there was no external load on ES. Everything we are currently seeing
is just whatever activity ES generates internally.

My understanding regarding optimizing indices is that you shouldn't call
it explicitly on indices that are regularly updating, rather you should let
the background merge process handle things. As the majority of our indices
regularly update, we don't explicitly call optimize on them. I can try to
call it on them all and see if it helps.

As for disk speed, we are currently running ES on SSDs. We have it in our
roadmap to change that to RAIDed SSDs, but it hasn't been a priority as we
have been getting acceptable performance thus far.

On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote:

Do you have a monitor tool running?

I recommend to switch it off, and optimize your indices, and then update
your monitoring tools.

Seems you have many segments/slow disk to get them reported in 15s.

Jörg
Am 05.12.2014 16:10 schrieb "Chris Moore" cmo...@perceivant.com:

This is running on Amazon EC2 in a VPC on dedicated instances. Physical
network infrastructure is likely fine. Are there specific network issues
you think we should look into?

When we are in a problem state, we can communicate between the nodes
just fine. I can run curl requests to ES (health checks, etc) from the
master node to the data nodes directly and they return as expected. So,
there doesn't seem to be a socket exhaustion issue (additionally there are
no kernel errors being reported).

It feels like there is a queue/buffer filling up somewhere that once it
has availability again, things start working. But, /_cat/thread_pool?v
doesn't show anything above 0 (although, when we are in the problem state,
it doesn't return a response if run on master), nodes/hot_threads doesn't
show anything going on, etc.

On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote:

I would think the network is a prime suspect then, as there is no
significant difference between 1.2.x and 1.3.x in relation to memory usage.
And you'd certainly see OOMs in node logs if it was a memory issue.

On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:

There is nothing (literally) in the log of either data node after the
node joined events and nothing in the master log between index recovery and
the first error message.

There are 0 queries run before the errors start occurring (access to
the nodes is blocked via a firewall, so the only communications are between
the nodes). We have 50% of the RAM allocated to the heap on each node (4GB
each).

This cluster operated without issue under 1.1.2. Did something change
between 1.1.2 and 1.3.5 that drastically increased idle heap requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote:

Generally ReceiveTimeoutTransportException is due to network
disconnects or a node failing to respond due to heavy load. What does the
log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
allocated. Rule of thumb is 1/2 available memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at
AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway ]
    [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after the
    recovery finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at org.elasticsearch.transport.TransportService$TimeoutHandler.run(
TransportService.java:356)

at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is reported for
    one or more of the data nodes

  2. During this time, queries (search, index, etc.) don’t return.
    They hang until the error state temporarily resolves itself (a varying time
    around 15-20 minutes) at which point the expected result is returned.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/01dfdc14-7105-46fe-a538-54e44ef50cb1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Updating again:

If we reduce the number of shards per node to below ~350, the system
operates fine. Once we go above that (number_of_indices *
number_of_shards_per_index * number_of_replicas / number_of_nodes), we
start running into the described issues.

On Friday, December 12, 2014 2:11:08 PM UTC-5, Chris Moore wrote:

Just a quick update, we duplicated our test environment to see if this
issue was fixed by upgrading to 1.4.1 instead. We received the same errors
under 1.4.1.

On Friday, December 5, 2014 4:52:05 PM UTC-5, Chris Moore wrote:

As a followup, I closed all the indices on the cluster. I would then open
1 index and optimize it down to 1 segment. I made it through ~60% of the
indices (and probably ~45% of the data) before the same errors showed up in
the master log and the same behavior resumed.

On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote:

I replied once, but it seems to have disappeared, so if this gets double
posted, I'm sorry.

We disabled all monitoring when we started looking into the issues to
ensure there was no external load on ES. Everything we are currently seeing
is just whatever activity ES generates internally.

My understanding regarding optimizing indices is that you shouldn't call
it explicitly on indices that are regularly updating, rather you should let
the background merge process handle things. As the majority of our indices
regularly update, we don't explicitly call optimize on them. I can try to
call it on them all and see if it helps.

As for disk speed, we are currently running ES on SSDs. We have it in
our roadmap to change that to RAIDed SSDs, but it hasn't been a priority as
we have been getting acceptable performance thus far.

On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote:

Do you have a monitor tool running?

I recommend to switch it off, and optimize your indices, and then
update your monitoring tools.

Seems you have many segments/slow disk to get them reported in 15s.

Jörg
Am 05.12.2014 16:10 schrieb "Chris Moore" cmo...@perceivant.com:

This is running on Amazon EC2 in a VPC on dedicated instances.
Physical network infrastructure is likely fine. Are there specific network
issues you think we should look into?

When we are in a problem state, we can communicate between the nodes
just fine. I can run curl requests to ES (health checks, etc) from the
master node to the data nodes directly and they return as expected. So,
there doesn't seem to be a socket exhaustion issue (additionally there are
no kernel errors being reported).

It feels like there is a queue/buffer filling up somewhere that once
it has availability again, things start working. But, /_cat/thread_pool?v
doesn't show anything above 0 (although, when we are in the problem state,
it doesn't return a response if run on master), nodes/hot_threads doesn't
show anything going on, etc.

On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote:

I would think the network is a prime suspect then, as there is no
significant difference between 1.2.x and 1.3.x in relation to memory usage.
And you'd certainly see OOMs in node logs if it was a memory issue.

On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:

There is nothing (literally) in the log of either data node after
the node joined events and nothing in the master log between index recovery
and the first error message.

There are 0 queries run before the errors start occurring (access to
the nodes is blocked via a firewall, so the only communications are between
the nodes). We have 50% of the RAM allocated to the heap on each node (4GB
each).

This cluster operated without issue under 1.1.2. Did something
change between 1.1.2 and 1.3.5 that drastically increased idle heap
requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote:

Generally ReceiveTimeoutTransportException is due to network
disconnects or a node failing to respond due to heavy load. What does the
log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
allocated. Rule of thumb is 1/2 available memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at
AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway ]
    [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after the
    recovery finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at org.elasticsearch.transport.TransportService$
TimeoutHandler.run(TransportService.java:356)

at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is reported for
    one or more of the data nodes

  2. During this time, queries (search, index, etc.) don’t return.
    They hang until the error state temporarily resolves itself (a varying time
    around 15-20 minutes) at which point the expected result is returned.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Can you elaborate on your dataset and structure; how many indexes, how many
shards, how big they are etc.

On 24 December 2014 at 07:36, Chris Moore cmoore@perceivant.com wrote:

Updating again:

If we reduce the number of shards per node to below ~350, the system
operates fine. Once we go above that (number_of_indices *
number_of_shards_per_index * number_of_replicas / number_of_nodes), we
start running into the described issues.

On Friday, December 12, 2014 2:11:08 PM UTC-5, Chris Moore wrote:

Just a quick update, we duplicated our test environment to see if this
issue was fixed by upgrading to 1.4.1 instead. We received the same errors
under 1.4.1.

On Friday, December 5, 2014 4:52:05 PM UTC-5, Chris Moore wrote:

As a followup, I closed all the indices on the cluster. I would then
open 1 index and optimize it down to 1 segment. I made it through ~60% of
the indices (and probably ~45% of the data) before the same errors showed
up in the master log and the same behavior resumed.

On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote:

I replied once, but it seems to have disappeared, so if this gets
double posted, I'm sorry.

We disabled all monitoring when we started looking into the issues to
ensure there was no external load on ES. Everything we are currently seeing
is just whatever activity ES generates internally.

My understanding regarding optimizing indices is that you shouldn't
call it explicitly on indices that are regularly updating, rather you
should let the background merge process handle things. As the majority of
our indices regularly update, we don't explicitly call optimize on them. I
can try to call it on them all and see if it helps.

As for disk speed, we are currently running ES on SSDs. We have it in
our roadmap to change that to RAIDed SSDs, but it hasn't been a priority as
we have been getting acceptable performance thus far.

On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote:

Do you have a monitor tool running?

I recommend to switch it off, and optimize your indices, and then
update your monitoring tools.

Seems you have many segments/slow disk to get them reported in 15s.

Jörg
Am 05.12.2014 16:10 schrieb "Chris Moore" cmo...@perceivant.com:

This is running on Amazon EC2 in a VPC on dedicated instances.
Physical network infrastructure is likely fine. Are there specific network
issues you think we should look into?

When we are in a problem state, we can communicate between the nodes
just fine. I can run curl requests to ES (health checks, etc) from the
master node to the data nodes directly and they return as expected. So,
there doesn't seem to be a socket exhaustion issue (additionally there are
no kernel errors being reported).

It feels like there is a queue/buffer filling up somewhere that once
it has availability again, things start working. But, /_cat/thread_pool?v
doesn't show anything above 0 (although, when we are in the problem state,
it doesn't return a response if run on master), nodes/hot_threads doesn't
show anything going on, etc.

On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote:

I would think the network is a prime suspect then, as there is no
significant difference between 1.2.x and 1.3.x in relation to memory usage.
And you'd certainly see OOMs in node logs if it was a memory issue.

On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:

There is nothing (literally) in the log of either data node after
the node joined events and nothing in the master log between index recovery
and the first error message.

There are 0 queries run before the errors start occurring (access
to the nodes is blocked via a firewall, so the only communications are
between the nodes). We have 50% of the RAM allocated to the heap on each
node (4GB each).

This cluster operated without issue under 1.1.2. Did something
change between 1.1.2 and 1.3.5 that drastically increased idle heap
requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey
wrote:

Generally ReceiveTimeoutTransportException is due to network
disconnects or a node failing to respond due to heavy load. What does the
log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
allocated. Rule of thumb is 1/2 available memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller
wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at
AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway ]
    [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after the
    recovery finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at org.elasticsearch.transport.TransportService$TimeoutHandler.
run(TransportService.java:356)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
Executor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
lExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is reported
    for one or more of the data nodes

  2. During this time, queries (search, index, etc.) don’t return.
    They hang until the error state temporarily resolves itself (a varying time
    around 15-20 minutes) at which point the expected result is returned.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9%2BHte3%3DV1Qxg-yP%3D2Siqd734RnemESX1ZNJ%3DrKjCt%3D8Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

We tried many different test setups yesterday. The first setup we tried was:

1 Master, 2 Data nodes
38 indices
10 shards per index
1 replica per index
760 total shards (380 primary, 760 total)
Each index had 2,745 documents
Each index was 218.9kb in size (according to the _cat/indices API)

We realize that 10 shards per index with only 2 nodes is not a good idea,
so we changed that and reran the tests.

We changed shards per index to the default of 5 and put 100 indices on the
2 boxes and ran into the same issue. It was the same dataset, so all other
size information is correct.

After that, we turned off one of the data nodes, set replicas to 0 and
shards per index to 1. With the same dataset, I loaded ~440 indices and ran
into the timeout issues with the Master and Data nodes just idling.

This is just a test dataset that we came up with to quickly test our issues
that contains no confidential information. Once we figure out the issues
affecting this test dataset, we'll try things with our real dataset.

All of this works fine on ES 1.1.2, but not on 1.3.x (1.3.5 is our current
test version). We have also tried our real setup on 1.4.1 to no avail.

On Tuesday, December 23, 2014 5:03:30 PM UTC-5, Mark Walkom wrote:

Can you elaborate on your dataset and structure; how many indexes, how
many shards, how big they are etc.

On 24 December 2014 at 07:36, Chris Moore <cmo...@perceivant.com
<javascript:>> wrote:

Updating again:

If we reduce the number of shards per node to below ~350, the system
operates fine. Once we go above that (number_of_indices *
number_of_shards_per_index * number_of_replicas / number_of_nodes), we
start running into the described issues.

On Friday, December 12, 2014 2:11:08 PM UTC-5, Chris Moore wrote:

Just a quick update, we duplicated our test environment to see if this
issue was fixed by upgrading to 1.4.1 instead. We received the same errors
under 1.4.1.

On Friday, December 5, 2014 4:52:05 PM UTC-5, Chris Moore wrote:

As a followup, I closed all the indices on the cluster. I would then
open 1 index and optimize it down to 1 segment. I made it through ~60% of
the indices (and probably ~45% of the data) before the same errors showed
up in the master log and the same behavior resumed.

On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote:

I replied once, but it seems to have disappeared, so if this gets
double posted, I'm sorry.

We disabled all monitoring when we started looking into the issues to
ensure there was no external load on ES. Everything we are currently seeing
is just whatever activity ES generates internally.

My understanding regarding optimizing indices is that you shouldn't
call it explicitly on indices that are regularly updating, rather you
should let the background merge process handle things. As the majority of
our indices regularly update, we don't explicitly call optimize on them. I
can try to call it on them all and see if it helps.

As for disk speed, we are currently running ES on SSDs. We have it in
our roadmap to change that to RAIDed SSDs, but it hasn't been a priority as
we have been getting acceptable performance thus far.

On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote:

Do you have a monitor tool running?

I recommend to switch it off, and optimize your indices, and then
update your monitoring tools.

Seems you have many segments/slow disk to get them reported in 15s.

Jörg
Am 05.12.2014 16:10 schrieb "Chris Moore" cmo...@perceivant.com:

This is running on Amazon EC2 in a VPC on dedicated instances.
Physical network infrastructure is likely fine. Are there specific network
issues you think we should look into?

When we are in a problem state, we can communicate between the nodes
just fine. I can run curl requests to ES (health checks, etc) from the
master node to the data nodes directly and they return as expected. So,
there doesn't seem to be a socket exhaustion issue (additionally there are
no kernel errors being reported).

It feels like there is a queue/buffer filling up somewhere that once
it has availability again, things start working. But, /_cat/thread_pool?v
doesn't show anything above 0 (although, when we are in the problem state,
it doesn't return a response if run on master), nodes/hot_threads doesn't
show anything going on, etc.

On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote:

I would think the network is a prime suspect then, as there is no
significant difference between 1.2.x and 1.3.x in relation to memory usage.
And you'd certainly see OOMs in node logs if it was a memory issue.

On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:

There is nothing (literally) in the log of either data node after
the node joined events and nothing in the master log between index recovery
and the first error message.

There are 0 queries run before the errors start occurring (access
to the nodes is blocked via a firewall, so the only communications are
between the nodes). We have 50% of the RAM allocated to the heap on each
node (4GB each).

This cluster operated without issue under 1.1.2. Did something
change between 1.1.2 and 1.3.5 that drastically increased idle heap
requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey
wrote:

Generally ReceiveTimeoutTransportException is due to network
disconnects or a node failing to respond due to heavy load. What does the
log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
allocated. Rule of thumb is 1/2 available memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller
wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM
at AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway ]
    [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after
    the recovery finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at org.elasticsearch.transport.TransportService$TimeoutHandler.
run(TransportService.java:356)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
Executor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
lExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is reported
    for one or more of the data nodes

  2. During this time, queries (search, index, etc.) don’t return.
    They hang until the error state temporarily resolves itself (a varying time
    around 15-20 minutes) at which point the expected result is returned.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/99a45801-
2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5b243d87-867b-4c48-b134-02f28735d4de%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Attached is the script we've been using to load the data and the dataset.
This is the mapping and a sample document

{
"baseball_1" : {
"mappings" : {
"team" : {
"properties" : {
"L" : {
"type" : "integer",
"store" : true
},
"W" : {
"type" : "integer",
"store" : true
},
"name" : {
"type" : "string",
"store" : true
},
"teamID" : {
"type" : "string",
"store" : true
},
"yearID" : {
"type" : "string",
"store" : true
}
}
}
}
}
}

{"yearID":"1871", "teamID":"PH1", "W":"21", "L":"7", "name":"Philadelphia
Athletics"}

On Wednesday, December 24, 2014 10:22:00 AM UTC-5, Chris Moore wrote:

We tried many different test setups yesterday. The first setup we tried
was:

1 Master, 2 Data nodes
38 indices
10 shards per index
1 replica per index
760 total shards (380 primary, 760 total)
Each index had 2,745 documents
Each index was 218.9kb in size (according to the _cat/indices API)

We realize that 10 shards per index with only 2 nodes is not a good idea,
so we changed that and reran the tests.

We changed shards per index to the default of 5 and put 100 indices on the
2 boxes and ran into the same issue. It was the same dataset, so all other
size information is correct.

After that, we turned off one of the data nodes, set replicas to 0 and
shards per index to 1. With the same dataset, I loaded ~440 indices and ran
into the timeout issues with the Master and Data nodes just idling.

This is just a test dataset that we came up with to quickly test our
issues that contains no confidential information. Once we figure out the
issues affecting this test dataset, we'll try things with our real dataset.

All of this works fine on ES 1.1.2, but not on 1.3.x (1.3.5 is our current
test version). We have also tried our real setup on 1.4.1 to no avail.

On Tuesday, December 23, 2014 5:03:30 PM UTC-5, Mark Walkom wrote:

Can you elaborate on your dataset and structure; how many indexes, how
many shards, how big they are etc.

On 24 December 2014 at 07:36, Chris Moore cmo...@perceivant.com wrote:

Updating again:

If we reduce the number of shards per node to below ~350, the system
operates fine. Once we go above that (number_of_indices *
number_of_shards_per_index * number_of_replicas / number_of_nodes), we
start running into the described issues.

On Friday, December 12, 2014 2:11:08 PM UTC-5, Chris Moore wrote:

Just a quick update, we duplicated our test environment to see if this
issue was fixed by upgrading to 1.4.1 instead. We received the same errors
under 1.4.1.

On Friday, December 5, 2014 4:52:05 PM UTC-5, Chris Moore wrote:

As a followup, I closed all the indices on the cluster. I would then
open 1 index and optimize it down to 1 segment. I made it through ~60% of
the indices (and probably ~45% of the data) before the same errors showed
up in the master log and the same behavior resumed.

On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote:

I replied once, but it seems to have disappeared, so if this gets
double posted, I'm sorry.

We disabled all monitoring when we started looking into the issues to
ensure there was no external load on ES. Everything we are currently seeing
is just whatever activity ES generates internally.

My understanding regarding optimizing indices is that you shouldn't
call it explicitly on indices that are regularly updating, rather you
should let the background merge process handle things. As the majority of
our indices regularly update, we don't explicitly call optimize on them. I
can try to call it on them all and see if it helps.

As for disk speed, we are currently running ES on SSDs. We have it in
our roadmap to change that to RAIDed SSDs, but it hasn't been a priority as
we have been getting acceptable performance thus far.

On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote:

Do you have a monitor tool running?

I recommend to switch it off, and optimize your indices, and then
update your monitoring tools.

Seems you have many segments/slow disk to get them reported in 15s.

Jörg
Am 05.12.2014 16:10 schrieb "Chris Moore" cmo...@perceivant.com:

This is running on Amazon EC2 in a VPC on dedicated instances.
Physical network infrastructure is likely fine. Are there specific network
issues you think we should look into?

When we are in a problem state, we can communicate between the
nodes just fine. I can run curl requests to ES (health checks, etc) from
the master node to the data nodes directly and they return as expected. So,
there doesn't seem to be a socket exhaustion issue (additionally there are
no kernel errors being reported).

It feels like there is a queue/buffer filling up somewhere that
once it has availability again, things start working. But,
/_cat/thread_pool?v doesn't show anything above 0 (although, when we are in
the problem state, it doesn't return a response if run on master),
nodes/hot_threads doesn't show anything going on, etc.

On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey
wrote:

I would think the network is a prime suspect then, as there is no
significant difference between 1.2.x and 1.3.x in relation to memory usage.
And you'd certainly see OOMs in node logs if it was a memory issue.

On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:

There is nothing (literally) in the log of either data node after
the node joined events and nothing in the master log between index recovery
and the first error message.

There are 0 queries run before the errors start occurring (access
to the nodes is blocked via a firewall, so the only communications are
between the nodes). We have 50% of the RAM allocated to the heap on each
node (4GB each).

This cluster operated without issue under 1.1.2. Did something
change between 1.1.2 and 1.3.5 that drastically increased idle heap
requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey
wrote:

Generally ReceiveTimeoutTransportException is due to network
disconnects or a node failing to respond due to heavy load. What does the
log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
allocated. Rule of thumb is 1/2 available memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller
wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM
at AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway ]
    [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after
    the recovery finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at org.elasticsearch.transport.TransportService$TimeoutHandler.
run(TransportService.java:356)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
Executor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
lExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is reported
    for one or more of the data nodes

  2. During this time, queries (search, index, etc.) don’t
    return. They hang until the error state temporarily resolves itself (a
    varying time around 15-20 minutes) at which point the expected result is
    returned.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/99a45801-
2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ok a few things that don't make sense to me;

  1. 10 indexes of only ~220Kb? Are you sure of this?
  2. If so why not just one index?
  3. Is baseball_data.json the data for an entire index? If not can you
    clarify.
  4. What java version are you on?
  5. What monitoring were you using?
  6. Can you delete all your data, switch monitoring on, start reindexing
    and then watch what happens? Marvel would be ideal for this.

What you are seeing is really, really weird. That is a high shard count
however given the dataset is small I wouldn't think it'd cause problems
(but I could be wrong).

On 25 December 2014 at 02:27, Chris Moore cmoore@perceivant.com wrote:

Attached is the script we've been using to load the data and the dataset.
This is the mapping and a sample document

{
"baseball_1" : {
"mappings" : {
"team" : {
"properties" : {
"L" : {
"type" : "integer",
"store" : true
},
"W" : {
"type" : "integer",
"store" : true
},
"name" : {
"type" : "string",
"store" : true
},
"teamID" : {
"type" : "string",
"store" : true
},
"yearID" : {
"type" : "string",
"store" : true
}
}
}
}
}
}

{"yearID":"1871", "teamID":"PH1", "W":"21", "L":"7", "name":"Philadelphia
Athletics"}

On Wednesday, December 24, 2014 10:22:00 AM UTC-5, Chris Moore wrote:

We tried many different test setups yesterday. The first setup we tried
was:

1 Master, 2 Data nodes
38 indices
10 shards per index
1 replica per index
760 total shards (380 primary, 760 total)
Each index had 2,745 documents
Each index was 218.9kb in size (according to the _cat/indices API)

We realize that 10 shards per index with only 2 nodes is not a good idea,
so we changed that and reran the tests.

We changed shards per index to the default of 5 and put 100 indices on
the 2 boxes and ran into the same issue. It was the same dataset, so all
other size information is correct.

After that, we turned off one of the data nodes, set replicas to 0 and
shards per index to 1. With the same dataset, I loaded ~440 indices and ran
into the timeout issues with the Master and Data nodes just idling.

This is just a test dataset that we came up with to quickly test our
issues that contains no confidential information. Once we figure out the
issues affecting this test dataset, we'll try things with our real dataset.

All of this works fine on ES 1.1.2, but not on 1.3.x (1.3.5 is our
current test version). We have also tried our real setup on 1.4.1 to no
avail.

On Tuesday, December 23, 2014 5:03:30 PM UTC-5, Mark Walkom wrote:

Can you elaborate on your dataset and structure; how many indexes, how
many shards, how big they are etc.

On 24 December 2014 at 07:36, Chris Moore cmo...@perceivant.com wrote:

Updating again:

If we reduce the number of shards per node to below ~350, the system
operates fine. Once we go above that (number_of_indices *
number_of_shards_per_index * number_of_replicas / number_of_nodes), we
start running into the described issues.

On Friday, December 12, 2014 2:11:08 PM UTC-5, Chris Moore wrote:

Just a quick update, we duplicated our test environment to see if this
issue was fixed by upgrading to 1.4.1 instead. We received the same errors
under 1.4.1.

On Friday, December 5, 2014 4:52:05 PM UTC-5, Chris Moore wrote:

As a followup, I closed all the indices on the cluster. I would then
open 1 index and optimize it down to 1 segment. I made it through ~60% of
the indices (and probably ~45% of the data) before the same errors showed
up in the master log and the same behavior resumed.

On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote:

I replied once, but it seems to have disappeared, so if this gets
double posted, I'm sorry.

We disabled all monitoring when we started looking into the issues
to ensure there was no external load on ES. Everything we are currently
seeing is just whatever activity ES generates internally.

My understanding regarding optimizing indices is that you shouldn't
call it explicitly on indices that are regularly updating, rather you
should let the background merge process handle things. As the majority of
our indices regularly update, we don't explicitly call optimize on them. I
can try to call it on them all and see if it helps.

As for disk speed, we are currently running ES on SSDs. We have it
in our roadmap to change that to RAIDed SSDs, but it hasn't been a priority
as we have been getting acceptable performance thus far.

On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote:

Do you have a monitor tool running?

I recommend to switch it off, and optimize your indices, and then
update your monitoring tools.

Seems you have many segments/slow disk to get them reported in 15s.

Jörg
Am 05.12.2014 16:10 schrieb "Chris Moore" cmo...@perceivant.com:

This is running on Amazon EC2 in a VPC on dedicated instances.
Physical network infrastructure is likely fine. Are there specific network
issues you think we should look into?

When we are in a problem state, we can communicate between the
nodes just fine. I can run curl requests to ES (health checks, etc) from
the master node to the data nodes directly and they return as expected. So,
there doesn't seem to be a socket exhaustion issue (additionally there are
no kernel errors being reported).

It feels like there is a queue/buffer filling up somewhere that
once it has availability again, things start working. But,
/_cat/thread_pool?v doesn't show anything above 0 (although, when we are in
the problem state, it doesn't return a response if run on master),
nodes/hot_threads doesn't show anything going on, etc.

On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey
wrote:

I would think the network is a prime suspect then, as there is no
significant difference between 1.2.x and 1.3.x in relation to memory usage.
And you'd certainly see OOMs in node logs if it was a memory issue.

On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore
wrote:

There is nothing (literally) in the log of either data node
after the node joined events and nothing in the master log between index
recovery and the first error message.

There are 0 queries run before the errors start occurring
(access to the nodes is blocked via a firewall, so the only communications
are between the nodes). We have 50% of the RAM allocated to the heap on
each node (4GB each).

This cluster operated without issue under 1.1.2. Did something
change between 1.1.2 and 1.3.5 that drastically increased idle heap
requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey
wrote:

Generally ReceiveTimeoutTransportException is due to network
disconnects or a node failing to respond due to heavy load. What does the
log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
allocated. Rule of thumb is 1/2 available memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller
wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM
at AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19,
ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway ]
    [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after
    the recovery finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at org.elasticsearch.transport.TransportService$
TimeoutHandler.run(TransportService.java:356)

at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is reported
    for one or more of the data nodes

  2. During this time, queries (search, index, etc.) don’t
    return. They hang until the error state temporarily resolves itself (a
    varying time around 15-20 minutes) at which point the expected result is
    returned.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b9
5-4a21-a6bf-ca724f41bbc2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_SPWg%3D9ky9iowVdJjnnsVB_kupCAuVfcyUjr%2BXYhZ6Ng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

We might have faced similar problem with ES 1.3.6. The reason we found
was might be due to concurrent merges. These settings have helped us
in fixing the issue.
merge:
policy:
max_merge_at_once: 5
reclaim_deletes_weight: 4.0
segments_per_tier: 5
indices:
store:
throttle:
max_bytes_per_sec: 40mb # as we have few SATA disk for storage
type: merge

you can check your hanged process by attaching jstack to it as

jstack -F

Also once you detach the jstack process become responding again and
joins cluster. Although it should not happen at all as if disk is the
limitation ES should not stop responding.

  • Gurvinder
    On 12/24/2014 08:00 PM, Mark Walkom wrote:

Ok a few things that don't make sense to me;

  1. 10 indexes of only ~220Kb? Are you sure of this? 2. If so why
    not just one index? 3. Is baseball_data.json the data for an entire
    index? If not can you clarify. 4. What java version are you on? 5.
    What monitoring were you using? 6. Can you delete all your data,
    switch monitoring on, start reindexing and then watch what happens?
    Marvel would be ideal for this.

What you are seeing is really, really weird. That is a high shard
count however given the dataset is small I wouldn't think it'd
cause problems (but I could be wrong).

On 25 December 2014 at 02:27, Chris Moore <cmoore@perceivant.com
mailto:cmoore@perceivant.com> wrote:

Attached is the script we've been using to load the data and the
dataset. This is the mapping and a sample document

{ "baseball_1" : { "mappings" : { "team" : { "properties" : { "L" :
{ "type" : "integer", "store" : true }, "W" : { "type" :
"integer", "store" : true }, "name" : { "type" : "string", "store"
: true }, "teamID" : { "type" : "string", "store" : true },
"yearID" : { "type" : "string", "store" : true } } } } } }

{"yearID":"1871", "teamID":"PH1", "W":"21", "L":"7",
"name":"Philadelphia Athletics"}

On Wednesday, December 24, 2014 10:22:00 AM UTC-5, Chris Moore
wrote:

We tried many different test setups yesterday. The first setup we
tried was:

1 Master, 2 Data nodes 38 indices 10 shards per index 1 replica per
index 760 total shards (380 primary, 760 total) Each index had
2,745 documents Each index was 218.9kb in size (according to the
_cat/indices API)

We realize that 10 shards per index with only 2 nodes is not a good
idea, so we changed that and reran the tests.

We changed shards per index to the default of 5 and put 100 indices
on the 2 boxes and ran into the same issue. It was the same
dataset, so all other size information is correct.

After that, we turned off one of the data nodes, set replicas to 0
and shards per index to 1. With the same dataset, I loaded ~440
indices and ran into the timeout issues with the Master and Data
nodes just idling.

This is just a test dataset that we came up with to quickly test
our issues that contains no confidential information. Once we
figure out the issues affecting this test dataset, we'll try things
with our real dataset.

All of this works fine on ES 1.1.2, but not on 1.3.x (1.3.5 is our
current test version). We have also tried our real setup on 1.4.1
to no avail.

On Tuesday, December 23, 2014 5:03:30 PM UTC-5, Mark Walkom wrote:

Can you elaborate on your dataset and structure; how many indexes,
how many shards, how big they are etc.

On 24 December 2014 at 07:36, Chris Moore cmo...@perceivant.com
wrote:

Updating again:

If we reduce the number of shards per node to below ~350, the
system operates fine. Once we go above that (number_of_indices *
number_of_shards_per_index * number_of_replicas / number_of_nodes),
we start running into the described issues.

On Friday, December 12, 2014 2:11:08 PM UTC-5, Chris Moore wrote:

Just a quick update, we duplicated our test environment to see if
this issue was fixed by upgrading to 1.4.1 instead. We received the
same errors under 1.4.1.

On Friday, December 5, 2014 4:52:05 PM UTC-5, Chris Moore wrote:

As a followup, I closed all the indices on the cluster. I would
then open 1 index and optimize it down to 1 segment. I made it
through ~60% of the indices (and probably ~45% of the data) before
the same errors showed up in the master log and the same behavior
resumed.

On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote:

I replied once, but it seems to have disappeared, so if this gets
double posted, I'm sorry.

We disabled all monitoring when we started looking into the issues
to ensure there was no external load on ES. Everything we are
currently seeing is just whatever activity ES generates
internally.

My understanding regarding optimizing indices is that you shouldn't
call it explicitly on indices that are regularly updating, rather
you should let the background merge process handle things. As the
majority of our indices regularly update, we don't explicitly call
optimize on them. I can try to call it on them all and see if it
helps.

As for disk speed, we are currently running ES on SSDs. We have it
in our roadmap to change that to RAIDed SSDs, but it hasn't been a
priority as we have been getting acceptable performance thus far.

On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote:

Do you have a monitor tool running?

I recommend to switch it off, and optimize your indices, and then
update your monitoring tools.

Seems you have many segments/slow disk to get them reported in
15s.

Jörg

Am 05.12.2014 16:10 schrieb "Chris Moore" cmo...@perceivant.com:

This is running on Amazon EC2 in a VPC on dedicated instances.
Physical network infrastructure is likely fine. Are there specific
network issues you think we should look into?

When we are in a problem state, we can communicate between the
nodes just fine. I can run curl requests to ES (health checks, etc)
from the master node to the data nodes directly and they return as
expected. So, there doesn't seem to be a socket exhaustion issue
(additionally there are no kernel errors being reported).

It feels like there is a queue/buffer filling up somewhere that
once it has availability again, things start working. But,
/_cat/thread_pool?v doesn't show anything above 0 (although, when
we are in the problem state, it doesn't return a response if run on
master), nodes/hot_threads doesn't show anything going on, etc.

On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey
wrote:

I would think the network is a prime suspect then, as there is no
significant difference between 1.2.x and 1.3.x in relation to
memory usage. And you'd certainly see OOMs in node logs if it was a
memory issue.

On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore
wrote:

There is nothing (literally) in the log of either data node after
the node joined events and nothing in the master log between index
recovery and the first error message.

There are 0 queries run before the errors start occurring (access
to the nodes is blocked via a firewall, so the only communications
are between the nodes). We have 50% of the RAM allocated to the
heap on each node (4GB each).

This cluster operated without issue under 1.1.2. Did something
change between 1.1.2 and 1.3.5 that drastically increased idle heap
requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey
wrote:

Generally __ReceiveTimeoutTransp____ortExcepti__on is due to
network disconnects or a node failing to respond due to heavy load.
What does the log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it
has too little heap allocated. Rule of thumb is 1/2 available
memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller
wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at
AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master 2. All nodes join cluster 3. [2014-12-03
    20:30:54,789][INFO ][gateway ] [ip-10-0-1-18.ec2.internal]
    recovered [157] indices into cluster_state 4. Checked health a few
    times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after the
    recovery finishes), the log on the master node (10.0.1.18)
    reports:

[2014-12-03
20:36:57,532][DEBUG][action.__ad____min.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.Re____ceiveTimeoutTransportExcepti__on__:

[ip-10-0-1-20.ec2.internal][__in____et[/10.0.1.20:9300]][__cluster/__n__odes/stats/n]

request_id [17564] timed out after [15001ms]

at
org.elasticsearch.transport.__Tr____ansportService$__TimeoutHandler.____run(__TransportService.java:356)

at
java.util.concurrent.__ThreadPoo____lExecutor.runWorker(__ThreadPool____Executor.java:1145)

at
java.util.concurrent.__ThreadPoo____lExecutor$Worker.run(__ThreadPoo____lExecutor.java:615)

at java.lang.Thread.run(Thread.__ja____va:745)

  1. Every 30 seconds or 60 seconds, the above error is reported for
    one or more of the data nodes

  2. During this time, queries (search, index, etc.) don’t return.
    They hang until the error state temporarily resolves itself (a
    varying time around 15-20 minutes) at which point the expected
    result is returned.

-- You received this message because you are subscribed to the
Google Groups "elasticsearch" group. To unsubscribe from this group
and stop receiving emails from it, send an email to
elasticsearc...@googlegroups.__c__om. To view this discussion on
the web visit
https://groups.google.com/d/__ms__gid/elasticsearch/99a45801-__2b9__5-4a21-a6bf-ca724f41bbc2%__40goo__glegroups.com

https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer.

For more options, visit https://groups.google.com/d/__op__tout
https://groups.google.com/d/optout.

-- You received this message because you are subscribed to the
Google Groups "elasticsearch" group. To unsubscribe from this group
and stop receiving emails from it, send an email to
elasticsearc...@googlegroups.__com. To view this discussion on the
web visit
https://groups.google.com/d/__msgid/elasticsearch/1ad26e40-__a1bf-4302-aba4-551c7d862db1%__40googlegroups.com

https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com?utm_medium=email&utm_source=footer.

For more options, visit https://groups.google.com/d/__optout
https://groups.google.com/d/optout.

-- You received this message because you are subscribed to the
Google Groups "elasticsearch" group. To unsubscribe from this group
and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com
mailto:elasticsearch+unsubscribe@googlegroups.com. To view this
discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com

https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com?utm_medium=email&utm_source=footer.

For more options, visit https://groups.google.com/d/optout.

-- You received this message because you are subscribed to the
Google Groups "elasticsearch" group. To unsubscribe from this group
and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com
mailto:elasticsearch+unsubscribe@googlegroups.com. To view this
discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_SPWg%3D9ky9iowVdJjnnsVB_kupCAuVfcyUjr%2BXYhZ6Ng%40mail.gmail.com

https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_SPWg%3D9ky9iowVdJjnnsVB_kupCAuVfcyUjr%2BXYhZ6Ng%40mail.gmail.com?utm_medium=email&utm_source=footer.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/549C1216.5060608%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for responding.

The baseball dataset was created simply to have a small dataset that could
reproduce the issue we were seeing with our production data. So yes, each
copy of the index is only 220kb in size.

It's not all just one index because it was created for testing purposes. We
currently have 2 production clusters, the first has ~95 indices and the
second has ~304 indices, each with unrelated data and various sizes from
just a few KB to hundreds of GB.

Yes, the baseball_data.json file is used to fully populate 1 index. We load
that file into each index we create (so baseball_1, baseball_2, baseball_3)
in the shell script so we can quickly analyze a number_of_indices,
number_of_shards_per_index, number_of_replicas combination to try to figure
out more about these issues.

As for Java version:
java version "1.7.0_72"
Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)

We hooked up Marvel to see if it would help us analyze the issue. It didn't
show anything interesting (full thread pools, any soft of resource
starvation, etc). I can provide specific metrics if that would help.

Thanks.

On Wednesday, December 24, 2014 2:00:42 PM UTC-5, Mark Walkom wrote:

Ok a few things that don't make sense to me;

  1. 10 indexes of only ~220Kb? Are you sure of this?
  2. If so why not just one index?
  3. Is baseball_data.json the data for an entire index? If not can you
    clarify.
  4. What java version are you on?
  5. What monitoring were you using?
  6. Can you delete all your data, switch monitoring on, start
    reindexing and then watch what happens? Marvel would be ideal for this.

What you are seeing is really, really weird. That is a high shard count
however given the dataset is small I wouldn't think it'd cause problems
(but I could be wrong).

On 25 December 2014 at 02:27, Chris Moore <cmo...@perceivant.com
<javascript:>> wrote:

Attached is the script we've been using to load the data and the dataset.
This is the mapping and a sample document

{
"baseball_1" : {
"mappings" : {
"team" : {
"properties" : {
"L" : {
"type" : "integer",
"store" : true
},
"W" : {
"type" : "integer",
"store" : true
},
"name" : {
"type" : "string",
"store" : true
},
"teamID" : {
"type" : "string",
"store" : true
},
"yearID" : {
"type" : "string",
"store" : true
}
}
}
}
}
}

{"yearID":"1871", "teamID":"PH1", "W":"21", "L":"7", "name":"Philadelphia
Athletics"}

On Wednesday, December 24, 2014 10:22:00 AM UTC-5, Chris Moore wrote:

We tried many different test setups yesterday. The first setup we tried
was:

1 Master, 2 Data nodes
38 indices
10 shards per index
1 replica per index
760 total shards (380 primary, 760 total)
Each index had 2,745 documents
Each index was 218.9kb in size (according to the _cat/indices API)

We realize that 10 shards per index with only 2 nodes is not a good
idea, so we changed that and reran the tests.

We changed shards per index to the default of 5 and put 100 indices on
the 2 boxes and ran into the same issue. It was the same dataset, so all
other size information is correct.

After that, we turned off one of the data nodes, set replicas to 0 and
shards per index to 1. With the same dataset, I loaded ~440 indices and ran
into the timeout issues with the Master and Data nodes just idling.

This is just a test dataset that we came up with to quickly test our
issues that contains no confidential information. Once we figure out the
issues affecting this test dataset, we'll try things with our real dataset.

All of this works fine on ES 1.1.2, but not on 1.3.x (1.3.5 is our
current test version). We have also tried our real setup on 1.4.1 to no
avail.

On Tuesday, December 23, 2014 5:03:30 PM UTC-5, Mark Walkom wrote:

Can you elaborate on your dataset and structure; how many indexes, how
many shards, how big they are etc.

On 24 December 2014 at 07:36, Chris Moore cmo...@perceivant.com
wrote:

Updating again:

If we reduce the number of shards per node to below ~350, the system
operates fine. Once we go above that (number_of_indices *
number_of_shards_per_index * number_of_replicas / number_of_nodes), we
start running into the described issues.

On Friday, December 12, 2014 2:11:08 PM UTC-5, Chris Moore wrote:

Just a quick update, we duplicated our test environment to see if
this issue was fixed by upgrading to 1.4.1 instead. We received the same
errors under 1.4.1.

On Friday, December 5, 2014 4:52:05 PM UTC-5, Chris Moore wrote:

As a followup, I closed all the indices on the cluster. I would then
open 1 index and optimize it down to 1 segment. I made it through ~60% of
the indices (and probably ~45% of the data) before the same errors showed
up in the master log and the same behavior resumed.

On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote:

I replied once, but it seems to have disappeared, so if this gets
double posted, I'm sorry.

We disabled all monitoring when we started looking into the issues
to ensure there was no external load on ES. Everything we are currently
seeing is just whatever activity ES generates internally.

My understanding regarding optimizing indices is that you shouldn't
call it explicitly on indices that are regularly updating, rather you
should let the background merge process handle things. As the majority of
our indices regularly update, we don't explicitly call optimize on them. I
can try to call it on them all and see if it helps.

As for disk speed, we are currently running ES on SSDs. We have it
in our roadmap to change that to RAIDed SSDs, but it hasn't been a priority
as we have been getting acceptable performance thus far.

On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote:

Do you have a monitor tool running?

I recommend to switch it off, and optimize your indices, and then
update your monitoring tools.

Seems you have many segments/slow disk to get them reported in 15s.

Jörg
Am 05.12.2014 16:10 schrieb "Chris Moore" cmo...@perceivant.com:

This is running on Amazon EC2 in a VPC on dedicated instances.
Physical network infrastructure is likely fine. Are there specific network
issues you think we should look into?

When we are in a problem state, we can communicate between the
nodes just fine. I can run curl requests to ES (health checks, etc) from
the master node to the data nodes directly and they return as expected. So,
there doesn't seem to be a socket exhaustion issue (additionally there are
no kernel errors being reported).

It feels like there is a queue/buffer filling up somewhere that
once it has availability again, things start working. But,
/_cat/thread_pool?v doesn't show anything above 0 (although, when we are in
the problem state, it doesn't return a response if run on master),
nodes/hot_threads doesn't show anything going on, etc.

On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey
wrote:

I would think the network is a prime suspect then, as there is
no significant difference between 1.2.x and 1.3.x in relation to memory
usage. And you'd certainly see OOMs in node logs if it was a memory issue.

On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore
wrote:

There is nothing (literally) in the log of either data node
after the node joined events and nothing in the master log between index
recovery and the first error message.

There are 0 queries run before the errors start occurring
(access to the nodes is blocked via a firewall, so the only communications
are between the nodes). We have 50% of the RAM allocated to the heap on
each node (4GB each).

This cluster operated without issue under 1.1.2. Did something
change between 1.1.2 and 1.3.5 that drastically increased idle heap
requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey
wrote:

Generally ReceiveTimeoutTransportException is due to network
disconnects or a node failing to respond due to heavy load. What does the
log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap
allocated. Rule of thumb is 1/2 available memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller
wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB
RAM at AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19,
ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master
  2. All nodes join cluster
  3. [2014-12-03 20:30:54,789][INFO ][gateway
    ] [ip-10-0-1-18.ec2.internal] recovered [157] indices into
    cluster_state
  4. Checked health a few times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after
    the recovery finishes), the log on the master node (10.0.1.18) reports:

[2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.ReceiveTimeoutTransportException:
[ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
request_id [17564] timed out after [15001ms]

at org.elasticsearch.transport.TransportService$
TimeoutHandler.run(TransportService.java:356)

at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

  1. Every 30 seconds or 60 seconds, the above error is
    reported for one or more of the data nodes

  2. During this time, queries (search, index, etc.) don’t
    return. They hang until the error state temporarily resolves itself (a
    varying time around 15-20 minutes) at which point the expected result is
    returned.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b9
5-4a21-a6bf-ca724f41bbc2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cfea64fd-bd98-47cc-8c4a-9ddb9cabc05d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I tried your configuration suggestions, but the behavior was no different.
I have attached the jstack output from the troubled node (master). It
didn't appear to indicate anything of note, but I have attached it.

On Thursday, December 25, 2014 8:33:20 AM UTC-5, Gurvinder Singh wrote:

We might have faced similar problem with ES 1.3.6. The reason we found
was might be due to concurrent merges. These settings have helped us
in fixing the issue.
merge:
policy:
max_merge_at_once: 5
reclaim_deletes_weight: 4.0
segments_per_tier: 5
indices:
store:
throttle:
max_bytes_per_sec: 40mb # as we have few SATA disk for storage
type: merge

you can check your hanged process by attaching jstack to it as

jstack -F

Also once you detach the jstack process become responding again and
joins cluster. Although it should not happen at all as if disk is the
limitation ES should not stop responding.

  • Gurvinder
    On 12/24/2014 08:00 PM, Mark Walkom wrote:

Ok a few things that don't make sense to me;

  1. 10 indexes of only ~220Kb? Are you sure of this? 2. If so why
    not just one index? 3. Is baseball_data.json the data for an entire
    index? If not can you clarify. 4. What java version are you on? 5.
    What monitoring were you using? 6. Can you delete all your data,
    switch monitoring on, start reindexing and then watch what happens?
    Marvel would be ideal for this.

What you are seeing is really, really weird. That is a high shard
count however given the dataset is small I wouldn't think it'd
cause problems (but I could be wrong).

On 25 December 2014 at 02:27, Chris Moore <cmo...@perceivant.com
<javascript:>
<mailto:cmo...@perceivant.com <javascript:>>> wrote:

Attached is the script we've been using to load the data and the
dataset. This is the mapping and a sample document

{ "baseball_1" : { "mappings" : { "team" : { "properties" : { "L" :
{ "type" : "integer", "store" : true }, "W" : { "type" :
"integer", "store" : true }, "name" : { "type" : "string", "store"
: true }, "teamID" : { "type" : "string", "store" : true },
"yearID" : { "type" : "string", "store" : true } } } } } }

{"yearID":"1871", "teamID":"PH1", "W":"21", "L":"7",
"name":"Philadelphia Athletics"}

On Wednesday, December 24, 2014 10:22:00 AM UTC-5, Chris Moore
wrote:

We tried many different test setups yesterday. The first setup we
tried was:

1 Master, 2 Data nodes 38 indices 10 shards per index 1 replica per
index 760 total shards (380 primary, 760 total) Each index had
2,745 documents Each index was 218.9kb in size (according to the
_cat/indices API)

We realize that 10 shards per index with only 2 nodes is not a good
idea, so we changed that and reran the tests.

We changed shards per index to the default of 5 and put 100 indices
on the 2 boxes and ran into the same issue. It was the same
dataset, so all other size information is correct.

After that, we turned off one of the data nodes, set replicas to 0
and shards per index to 1. With the same dataset, I loaded ~440
indices and ran into the timeout issues with the Master and Data
nodes just idling.

This is just a test dataset that we came up with to quickly test
our issues that contains no confidential information. Once we
figure out the issues affecting this test dataset, we'll try things
with our real dataset.

All of this works fine on ES 1.1.2, but not on 1.3.x (1.3.5 is our
current test version). We have also tried our real setup on 1.4.1
to no avail.

On Tuesday, December 23, 2014 5:03:30 PM UTC-5, Mark Walkom wrote:

Can you elaborate on your dataset and structure; how many indexes,
how many shards, how big they are etc.

On 24 December 2014 at 07:36, Chris Moore cmo...@perceivant.com
wrote:

Updating again:

If we reduce the number of shards per node to below ~350, the
system operates fine. Once we go above that (number_of_indices *
number_of_shards_per_index * number_of_replicas / number_of_nodes),
we start running into the described issues.

On Friday, December 12, 2014 2:11:08 PM UTC-5, Chris Moore wrote:

Just a quick update, we duplicated our test environment to see if
this issue was fixed by upgrading to 1.4.1 instead. We received the
same errors under 1.4.1.

On Friday, December 5, 2014 4:52:05 PM UTC-5, Chris Moore wrote:

As a followup, I closed all the indices on the cluster. I would
then open 1 index and optimize it down to 1 segment. I made it
through ~60% of the indices (and probably ~45% of the data) before
the same errors showed up in the master log and the same behavior
resumed.

On Friday, December 5, 2014 3:57:12 PM UTC-5, Chris Moore wrote:

I replied once, but it seems to have disappeared, so if this gets
double posted, I'm sorry.

We disabled all monitoring when we started looking into the issues
to ensure there was no external load on ES. Everything we are
currently seeing is just whatever activity ES generates
internally.

My understanding regarding optimizing indices is that you shouldn't
call it explicitly on indices that are regularly updating, rather
you should let the background merge process handle things. As the
majority of our indices regularly update, we don't explicitly call
optimize on them. I can try to call it on them all and see if it
helps.

As for disk speed, we are currently running ES on SSDs. We have it
in our roadmap to change that to RAIDed SSDs, but it hasn't been a
priority as we have been getting acceptable performance thus far.

On Friday, December 5, 2014 2:59:11 PM UTC-5, Jörg Prante wrote:

Do you have a monitor tool running?

I recommend to switch it off, and optimize your indices, and then
update your monitoring tools.

Seems you have many segments/slow disk to get them reported in
15s.

Jörg

Am 05.12.2014 16:10 schrieb "Chris Moore" cmo...@perceivant.com:

This is running on Amazon EC2 in a VPC on dedicated instances.
Physical network infrastructure is likely fine. Are there specific
network issues you think we should look into?

When we are in a problem state, we can communicate between the
nodes just fine. I can run curl requests to ES (health checks, etc)
from the master node to the data nodes directly and they return as
expected. So, there doesn't seem to be a socket exhaustion issue
(additionally there are no kernel errors being reported).

It feels like there is a queue/buffer filling up somewhere that
once it has availability again, things start working. But,
/_cat/thread_pool?v doesn't show anything above 0 (although, when
we are in the problem state, it doesn't return a response if run on
master), nodes/hot_threads doesn't show anything going on, etc.

On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey
wrote:

I would think the network is a prime suspect then, as there is no
significant difference between 1.2.x and 1.3.x in relation to
memory usage. And you'd certainly see OOMs in node logs if it was a
memory issue.

On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore
wrote:

There is nothing (literally) in the log of either data node after
the node joined events and nothing in the master log between index
recovery and the first error message.

There are 0 queries run before the errors start occurring (access
to the nodes is blocked via a firewall, so the only communications
are between the nodes). We have 50% of the RAM allocated to the
heap on each node (4GB each).

This cluster operated without issue under 1.1.2. Did something
change between 1.1.2 and 1.3.5 that drastically increased idle heap
requirements?

On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey
wrote:

Generally __ReceiveTimeoutTransp____ortExcepti__on is due to
network disconnects or a node failing to respond due to heavy load.
What does the log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it
has too little heap allocated. Rule of thumb is 1/2 available
memory but <= 31GB

On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller
wrote:

ES Version: 1.3.5

OS: Ubuntu 14.04.1 LTS

Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at
AWS

master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)

After upgrading from ES 1.1.2...

  1. Startup ES on master 2. All nodes join cluster 3. [2014-12-03
    20:30:54,789][INFO ][gateway ] [ip-10-0-1-18.ec2.internal]
    recovered [157] indices into cluster_state 4. Checked health a few
    times

curl -XGET localhost:9200/_cat/health?v

  1. 6 minutes after cluster recovery initiates (and 5:20 after the
    recovery finishes), the log on the master node (10.0.1.18)
    reports:

[2014-12-03
20:36:57,532][DEBUG][action.__ad____min.cluster.node.stats]
[ip-10-0-1-18.ec2.internal] failed to execute on node
[pYi3z5PgRh6msJX_armz_A]

org.elasticsearch.transport.Re____ceiveTimeoutTransportExcepti__on__:

[ip-10-0-1-20.ec2.internal][__in____et[/10.0.1.20:9300]][__cluster/__n__odes/stats/n]

request_id [17564] timed out after [15001ms]

at

org.elasticsearch.transport.__Tr____ansportService$__TimeoutHandler.____run(__TransportService.java:356)

at

java.util.concurrent.__ThreadPoo____lExecutor.runWorker(__ThreadPool____Executor.java:1145)

at

java.util.concurrent.__ThreadPoo____lExecutor$Worker.run(__ThreadPoo____lExecutor.java:615)

at java.lang.Thread.run(Thread.__ja____va:745)

  1. Every 30 seconds or 60 seconds, the above error is reported for
    one or more of the data nodes

  2. During this time, queries (search, index, etc.) don’t return.
    They hang until the error state temporarily resolves itself (a
    varying time around 15-20 minutes) at which point the expected
    result is returned.

-- You received this message because you are subscribed to the
Google Groups "elasticsearch" group. To unsubscribe from this group
and stop receiving emails from it, send an email to
elasticsearc...@googlegroups.__c__om. To view this discussion on
the web visit

https://groups.google.com/d/__ms__gid/elasticsearch/99a45801-__2b9__5-4a21-a6bf-ca724f41bbc2%__40goo__glegroups.com

<
https://groups.google.com/d/msgid/elasticsearch/99a45801-2b95-4a21-a6bf-ca724f41bbc2%40googlegroups.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/__op__tout
https://groups.google.com/d/optout.

-- You received this message because you are subscribed to the
Google Groups "elasticsearch" group. To unsubscribe from this group
and stop receiving emails from it, send an email to
elasticsearc...@googlegroups.__com. To view this discussion on the
web visit

https://groups.google.com/d/__msgid/elasticsearch/1ad26e40-__a1bf-4302-aba4-551c7d862db1%__40googlegroups.com

<
https://groups.google.com/d/msgid/elasticsearch/1ad26e40-a1bf-4302-aba4-551c7d862db1%40googlegroups.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/__optout
https://groups.google.com/d/optout.

-- You received this message because you are subscribed to the
Google Groups "elasticsearch" group. To unsubscribe from this group
and stop receiving emails from it, send an email to
elasticsearc...@googlegroups.com <javascript:>
<mailto:elasticsearch+unsubscribe@googlegroups.com <javascript:>>. To
view this
discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com

<
https://groups.google.com/d/msgid/elasticsearch/71676baf-b85b-4ebe-8a34-14483162c685%40googlegroups.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.

-- You received this message because you are subscribed to the
Google Groups "elasticsearch" group. To unsubscribe from this group
and stop receiving emails from it, send an email to
elasticsearc...@googlegroups.com <javascript:>
<mailto:elasticsearch+unsubscribe@googlegroups.com <javascript:>>. To
view this
discussion on the web visit

https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_SPWg%3D9ky9iowVdJjnnsVB_kupCAuVfcyUjr%2BXYhZ6Ng%40mail.gmail.com

<
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_SPWg%3D9ky9iowVdJjnnsVB_kupCAuVfcyUjr%2BXYhZ6Ng%40mail.gmail.com?utm_medium=email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/12685f92-441f-4bac-96ea-c7dd3b0cba47%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.