Another node tries to become master (possibly due to GC hangs)

Hi,

I've a setup of two nodes (index11 and index12) in a cluster. The
indices are created everyday with shards as 4 and zero replicas. From
the logs and graphs, I can relate that about the time indexing stopped
in ES,

index11 says (3 log lines): http://sprunge.us/GXHe
and
index12 says (3 log lines): http://sprunge.us/OWbd

And when I checked the status 50 minutes after this had happened, I see
in elasticsearch-head that when I connect to index11, it says everything
as green but in index12, cluster health is red.

Health of index11:
{"cluster_name":"logstash","status":"green","timed_out":false,"number_of_nodes":6,"number_of_data_nodes":2,"active_primary_shards":20,"active_shards":20,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0}

Health of index12:
{"cluster_name":"logstash","status":"red","timed_out":false,"number_of_nodes":5,"number_of_data_nodes":1,"active_primary_shards":10,"active_shards":10,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":10}

Has something has gone wrong during gc pauses? Is there a way I can
avoid this? This is pretty serious and needs to be fixed. Please ask if
there is anything I can help with to debug this.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Try using the cluster state API to help debug what the current state is.
Since there might be two clusters, you would need to check the status on
every node:
http://:9200/_cluster/state
http://:9200/_cluster/nodes

How many nodes in your cluster and is minimum_master_nodes set correctly?
Are your ping timeouts using the default values? GC will always happen, but
they would need to be large to cause timeouts.

--
Ivan

On Tue, Mar 5, 2013 at 6:53 AM, Abhijeet abhijeet.1989@gmail.com wrote:

Hi,

I've a setup of two nodes (index11 and index12) in a cluster. The indices
are created everyday with shards as 4 and zero replicas. From the logs and
graphs, I can relate that about the time indexing stopped in ES,

index11 says (3 log lines): http://sprunge.us/GXHe
and
index12 says (3 log lines): http://sprunge.us/OWbd

And when I checked the status 50 minutes after this had happened, I see in
elasticsearch-head that when I connect to index11, it says everything as
green but in index12, cluster health is red.

Health of index11: {"cluster_name":"logstash","**
status":"green","timed_out":false,"number_of_nodes":6,"
number_of_data_nodes":2,"active_primary_shards":20,"
active_shards":20,"relocating_shards":0,"initializing_
shards":0,"unassigned_shards":**0}

Health of index12:
{"cluster_name":"logstash","status":"red","timed_out":
false,"number_of_nodes":5,"number_of_data_nodes":1,"
active_primary_shards":10,"active_shards":10,"relocating_
shards":0,"initializing_**shards":0,"unassigned_shards":**10}

Has something has gone wrong during gc pauses? Is there a way I can avoid
this? This is pretty serious and needs to be fixed. Please ask if there is
anything I can help with to debug this.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On 03/05/2013 10:18 PM, Ivan Brusic wrote:

Try using the cluster state API to help debug what the current state
is. Since there might be two clusters, you would need to check the
status on every node:
http://:9200/_cluster/state
http://:9200/_cluster/nodes
I can't really get the stats now as I had restarted to fix the problem?
Can you tell me what precisely were you expecting to find so that I can
have a look at it next time this happens?

How many nodes in your cluster and is minimum_master_nodes set
correctly? Are your ping timeouts using the default values? GC will
always happen, but they would need to be large to cause timeouts.
This option discovery.zen.minimum_master_nodes is commented in my case &
I have two nodes. In my normal functioning, I never restart the nodes,
should I set discovery.zen.ping.timeout to like really long? (Right now,
it's 10 seconds)

I've 48GB RAM on these nodes and these nodes generate a new index
everyday with 400-500GB of data. I start ES with options "-Xms4g -Xmx10g
-Xss256k" as of now. Is there something I should change here?

--
Ivan

On Tue, Mar 5, 2013 at 6:53 AM, Abhijeet <abhijeet.1989@gmail.com
mailto:abhijeet.1989@gmail.com> wrote:

Hi,

I've a setup of two nodes (index11 and index12) in a cluster. The
indices are created everyday with shards as 4 and zero replicas.
From the logs and graphs, I can relate that about the time
indexing stopped in ES,

index11 says (3 log lines): http://sprunge.us/GXHe
and
 index12 says (3 log lines): http://sprunge.us/OWbd

And when I checked the status 50 minutes after this had happened,
I see in elasticsearch-head that when I connect to index11, it
says everything as green but in index12, cluster health is red.

Health of index11:
{"cluster_name":"logstash","status":"green","timed_out":false,"number_of_nodes":6,"number_of_data_nodes":2,"active_primary_shards":20,"active_shards":20,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0}

Health of index12:
{"cluster_name":"logstash","status":"red","timed_out":false,"number_of_nodes":5,"number_of_data_nodes":1,"active_primary_shards":10,"active_shards":10,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":10}

Has something has gone wrong during gc pauses? Is there a way I
can avoid this? This is pretty serious and needs to be fixed.
Please ask if there is anything I can help with to debug this.

-- 
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearch+unsubscribe@googlegroups.com
<mailto:elasticsearch%2Bunsubscribe@googlegroups.com>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

"-Xms4g -Xmx10g"

This may not be your problem, but I would just like to say I would
personally set the min/max to the same value, because otherwise you will
get process virtual memory fragmentation over a long running process. By
setting them to the same, the java process can alloc a single contiguous
memory segment for the heap and can then properly reshuffle objects within
that for efficiency. Without a contiguos segment, you will get spotty
segments where the GC activity will need to work harder finding places for
things to fit.

Why would you want the JVM to trim itself down? If you're prepared to
alloc 10Gb heap, why not just let it have it forever, it's much better in
the long run.

regards,

Paul

On 6 March 2013 17:45, Abhijeet abhijeet.1989@gmail.com wrote:

On 03/05/2013 10:18 PM, Ivan Brusic wrote:

Try using the cluster state API to help debug what the current state is.
Since there might be two clusters, you would need to check the status on
every node:
http://:9200/**cluster/state
http://:9200/
**cluster/nodes

I can't really get the stats now as I had restarted to fix the problem?
Can you tell me what precisely were you expecting to find so that I can
have a look at it next time this happens?

How many nodes in your cluster and is minimum_master_nodes set correctly?
Are your ping timeouts using the default values? GC will always happen, but
they would need to be large to cause timeouts.

This option discovery.zen.minimum_master_**nodes is commented in my case
& I have two nodes. In my normal functioning, I never restart the nodes,
should I set discovery.zen.ping.timeout to like really long? (Right now,
it's 10 seconds)

I've 48GB RAM on these nodes and these nodes generate a new index everyday
with 400-500GB of data. I start ES with options "-Xms4g -Xmx10g -Xss256k"
as of now. Is there something I should change here?

--
Ivan

On Tue, Mar 5, 2013 at 6:53 AM, Abhijeet <abhijeet.1989@gmail.com<mailto:
abhijeet.1989@gmail.**com abhijeet.1989@gmail.com>> wrote:

Hi,

I've a setup of two nodes (index11 and index12) in a cluster. The
indices are created everyday with shards as 4 and zero replicas.
From the logs and graphs, I can relate that about the time
indexing stopped in ES,

index11 says (3 log lines): http://sprunge.us/GXHe
and
 index12 says (3 log lines): http://sprunge.us/OWbd

And when I checked the status 50 minutes after this had happened,
I see in elasticsearch-head that when I connect to index11, it
says everything as green but in index12, cluster health is red.

Health of index11:
{"cluster_name":"logstash","**status":"green","timed_out":**

false,"number_of_nodes":6,"number_of_data_nodes":2,"
active_primary_shards":20,"active_shards":20,"relocating_
shards":0,"initializing_**shards":0,"unassigned_shards":**0}

Health of index12:
{"cluster_name":"logstash","**status":"red","timed_out":**

false,"number_of_nodes":5,"number_of_data_nodes":1,"
active_primary_shards":10,"active_shards":10,"relocating_
shards":0,"initializing_**shards":0,"unassigned_shards":**10}

Has something has gone wrong during gc pauses? Is there a way I
can avoid this? This is pretty serious and needs to be fixed.
Please ask if there is anything I can help with to debug this.

--     You received this message because you are subscribed to the

Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
<mailto:elasticsearch%**2Bunsubscribe@googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
**>.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.