ES 1.4.2 random node disconnect

Hey,

I am having trouble for some while. I am getting random node disconnects
and I cannot explain why.
There is no increase in traffic ( search or index ) when this is happening
, it feels so random to me .
I first thought it could be the aws cloud plugin so I removed it and used
unicast and pointed directly to my nodes IPs but that didn't seem to be the
problem .
I changed the type of instances, now m3.2xlarge, added more instances, made
so much modifications in ES yml config and still nothing .
Changed java oracle from 1.7 to 1.8 , changed CMS collector to G1GC and
still nothing .

I am out of ideas ... how can I get more info on what is going on ?

Here are the logs I can see from master node and the data node
http://pastebin.com/GhKfRkaa

Current config:

6 m3.x2large, 1 master, 5 data nodes.
414 indices, index/day
7372 shards. 9 shards, 1 replica per index
208 million documents, 430 GB
15 gb heap size allocated per node
ES 1.4.2

Current yml config here :
http://pastebin.com/Nmdr7F6J

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/85cc2abe-da8e-4170-8e7d-a4e01f4a22c3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

If you see

cluster:monitor/nodes/stats[n]] request_id [82300775] timed out after
[15000ms]

in the logs, you have a monitor tool running that can not complete requests
because it takes longer than 15 seconds to traverse all the data folders on
all the nodes.

There are a number of methods to reduce disk traversal time in the data
folders:

  • switch off monitoring (not really helpful) or reduce monitor interval
    (maybe helpful, maybe not)

  • increase stats request timeout (if monitor tools allow this but this does
    not solve the cause of the problem)

  • monitor only an index subset of your cluster (monitor tools usually do
    not have this option)

  • reduce number of segments per node -> either by optimizing indices or
    adding nodes

  • wait for a fix in a future ES release

Have you counted the total number of segments? If the number is high, did
you run _optimize with max_num_segments on your indices to reduce the
number of segments?

Jörg

On Fri, Jan 9, 2015 at 6:55 AM, Revan007 dragosr@pionix.ro wrote:

Hey,

I am having trouble for some while. I am getting random node disconnects
and I cannot explain why.
There is no increase in traffic ( search or index ) when this is happening
, it feels so random to me .
I first thought it could be the aws cloud plugin so I removed it and used
unicast and pointed directly to my nodes IPs but that didn't seem to be the
problem .
I changed the type of instances, now m3.2xlarge, added more instances,
made so much modifications in ES yml config and still nothing .
Changed java oracle from 1.7 to 1.8 , changed CMS collector to G1GC and
still nothing .

I am out of ideas ... how can I get more info on what is going on ?

Here are the logs I can see from master node and the data node
http://pastebin.com/GhKfRkaa

Current config:

6 m3.x2large, 1 master, 5 data nodes.
414 indices, index/day
7372 shards. 9 shards, 1 replica per index
208 million documents, 430 GB
15 gb heap size allocated per node
ES 1.4.2

Current yml config here :
http://pastebin.com/Nmdr7F6J

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/85cc2abe-da8e-4170-8e7d-a4e01f4a22c3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/85cc2abe-da8e-4170-8e7d-a4e01f4a22c3%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHv0wxNXq_nJrj5ByxrpZmwbdiKmMUbu4YYfjuGM5XkAA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hey, thank you for answering, I am using Marvel latest version.

Here is more info about the problem :

On Saturday, January 10, 2015 at 2:50:02 PM UTC+2, Jörg Prante wrote:

If you see

cluster:monitor/nodes/stats[n]] request_id [82300775] timed out after
[15000ms]

in the logs, you have a monitor tool running that can not complete
requests because it takes longer than 15 seconds to traverse all the data
folders on all the nodes.

There are a number of methods to reduce disk traversal time in the data
folders:

  • switch off monitoring (not really helpful) or reduce monitor interval
    (maybe helpful, maybe not)

  • increase stats request timeout (if monitor tools allow this but this
    does not solve the cause of the problem)

  • monitor only an index subset of your cluster (monitor tools usually do
    not have this option)

  • reduce number of segments per node -> either by optimizing indices or
    adding nodes

  • wait for a fix in a future ES release

Have you counted the total number of segments? If the number is high, did
you run _optimize with max_num_segments on your indices to reduce the
number of segments?

Jörg

On Fri, Jan 9, 2015 at 6:55 AM, Revan007 <dra...@pionix.ro <javascript:>>
wrote:

Hey,

I am having trouble for some while. I am getting random node disconnects
and I cannot explain why.
There is no increase in traffic ( search or index ) when this is
happening , it feels so random to me .
I first thought it could be the aws cloud plugin so I removed it and used
unicast and pointed directly to my nodes IPs but that didn't seem to be the
problem .
I changed the type of instances, now m3.2xlarge, added more instances,
made so much modifications in ES yml config and still nothing .
Changed java oracle from 1.7 to 1.8 , changed CMS collector to G1GC and
still nothing .

I am out of ideas ... how can I get more info on what is going on ?

Here are the logs I can see from master node and the data node
http://pastebin.com/GhKfRkaa

Current config:

6 m3.x2large, 1 master, 5 data nodes.
414 indices, index/day
7372 shards. 9 shards, 1 replica per index
208 million documents, 430 GB
15 gb heap size allocated per node
ES 1.4.2

Current yml config here :
http://pastebin.com/Nmdr7F6J

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/85cc2abe-da8e-4170-8e7d-a4e01f4a22c3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/85cc2abe-da8e-4170-8e7d-a4e01f4a22c3%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fca12a91-07f6-4152-a4e4-97098e68fd0e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

The thing is I don't think is the monitor plugin. When this happens, my
node gets disconnected and the cluster goes into yellow state till it
recovers . I am using curator optimize , it is set to 2 segments for
indices older than 2 days .

On Saturday, January 10, 2015 at 2:56:52 PM UTC+2, Revan007 wrote:

Hey, thank you for answering, I am using Marvel latest version.

Here is more info about the problem :

https://github.com/elasticsearch/elasticsearch/issues/9212#issuecomment-69292232

On Saturday, January 10, 2015 at 2:50:02 PM UTC+2, Jörg Prante wrote:

If you see

cluster:monitor/nodes/stats[n]] request_id [82300775] timed out after
[15000ms]

in the logs, you have a monitor tool running that can not complete
requests because it takes longer than 15 seconds to traverse all the data
folders on all the nodes.

There are a number of methods to reduce disk traversal time in the data
folders:

  • switch off monitoring (not really helpful) or reduce monitor interval
    (maybe helpful, maybe not)

  • increase stats request timeout (if monitor tools allow this but this
    does not solve the cause of the problem)

  • monitor only an index subset of your cluster (monitor tools usually do
    not have this option)

  • reduce number of segments per node -> either by optimizing indices or
    adding nodes

  • wait for a fix in a future ES release

Have you counted the total number of segments? If the number is high, did
you run _optimize with max_num_segments on your indices to reduce the
number of segments?

Jörg

On Fri, Jan 9, 2015 at 6:55 AM, Revan007 dra...@pionix.ro wrote:

Hey,

I am having trouble for some while. I am getting random node disconnects
and I cannot explain why.
There is no increase in traffic ( search or index ) when this is
happening , it feels so random to me .
I first thought it could be the aws cloud plugin so I removed it and
used unicast and pointed directly to my nodes IPs but that didn't seem to
be the problem .
I changed the type of instances, now m3.2xlarge, added more instances,
made so much modifications in ES yml config and still nothing .
Changed java oracle from 1.7 to 1.8 , changed CMS collector to G1GC and
still nothing .

I am out of ideas ... how can I get more info on what is going on ?

Here are the logs I can see from master node and the data node
http://pastebin.com/GhKfRkaa

Current config:

6 m3.x2large, 1 master, 5 data nodes.
414 indices, index/day
7372 shards. 9 shards, 1 replica per index
208 million documents, 430 GB
15 gb heap size allocated per node
ES 1.4.2

Current yml config here :
http://pastebin.com/Nmdr7F6J

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/85cc2abe-da8e-4170-8e7d-a4e01f4a22c3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/85cc2abe-da8e-4170-8e7d-a4e01f4a22c3%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4cc37e2e-4bbc-483d-bbbe-6cd0138d6689%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.