Very weird ES Cluster state problem!

Amit · February 1, 2013, 6:01pm

Hi All,

I am having an ES cluster with 2 nodes. I am not as to what caused this
issue;

Node 2-

[2] received shard failed for [TestDocTestDoc][2],
node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start
shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard
allocated for local recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:37:41,667][WARN ][indices.cluster ] [2]
[TestDoc][2] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[TestDoc][2] shard allocated for local recovery (post api), should exists,
but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:108)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard ] [2] sending
failed shard for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard ] [2] received
shard failed for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:39:00,306][WARN ][discovery.zen ] [2] master
should not receive new cluster state from
[[1][IlJPr1CBTmKxSgHyHJ7brg][inet[/10.190.209.134:9300]]]

Node1-

[2013-02-01 10:08:03,861][DEBUG][action.search.type ] [1] failed to
reduce search
org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to
execute phase [fetch], [reduce]
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:177)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:155)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:1)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:345)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(TransportSearchQueryThenFetchAction.java:149)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2.run(TransportSearchQueryThenFetchAction.java:136)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassCastException:
org.elasticsearch.search.facet.termsstats.longs.InternalTermsStatsLongFacet
cannot be cast to
org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti
at
org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti.reduce(InternalTermsStatsStringFacetMulti.java:490)
at
org.elasticsearch.plugin.multifssearch.TermsStatsFacetProcessorMulti.reduce(TermsStatsFacetProcessorMulti.java:166)
at
org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:296)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.innerFinishHim(TransportSearchQueryThenFetchAction.java:190)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:175)
... 8 more
[2013-02-01 12:59:42,516][INFO ][cluster.metadata ] [1] [TestDoc2]
creating index, cause [auto(bulk api)], shards [5]/[0], mappings
[TestDoc2~type1]
[2013-02-01 13:00:34,555][INFO ][cluster.metadata ] [1] [TestDoc3]
creating index, cause [auto(bulk api)], shards [5]/[0], mappings
[TestDoc3~type1]

The Cluster health api from node1;

{
"cluster_name" : "test1",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 4256,
"active_shards" : 4256,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 4209
}

The Cluster health api from node2;
{
"cluster_name" : "test",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 8471,
"active_shards" : 8471,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 0
}

I looked through the ES group but could not find the exact issue.
It looks like one of the node ( primary) left the cluster because of the
network issue( not sure what was the issue, assuming network issue). And
the secondary got elected as master. And when the network issue was
resolved. The primary node was trying to join the cluster, which did
happen. But probably the state was not synched? or there two master nodes
master1- having two node in cluster, but not able to communicate with data
node. master2- having only one node in cluster.

Please help me as this is going crazy over my head. I looked through the
different threads, but nothing concrete.

Thanks in advance
Amit

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 1, 2013, 10:18pm

Yes, it looks like a crash or a split. The problematic indexes are lost,
and should be deleted. Otherwise, ES is not able to resolve the
conflict. Note, there are precautions against such node splits, did you
change the default settings in zen discovery, minimum_master_nodes for
example?

Best regards,

Jörg

Am 01.02.13 19:01, schrieb Amit Singh:

Hi All,

I am having an ES cluster with 2 nodes. I am not as to what caused
this issue;

Node 2-

[2] received shard failed for [TestDocTestDoc][2],
node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to
start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2]
shard allocated for local recovery (post api), should exists, but
doesn't]]]
[2013-02-01 16:37:41,667][WARN ][indices.cluster ] [2]
[TestDoc][2] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[TestDoc][2] shard allocated for local recovery (post api), should
exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:108)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard ] [2]
sending failed shard for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg],
[P], s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for
local recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard ] [2]
received shard failed for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg],
[P], s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for
local recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:39:00,306][WARN ][discovery.zen ] [2] master
should not receive new cluster state from
[[1][IlJPr1CBTmKxSgHyHJ7brg][inet[/10.190.209.134:9300]]]

Node1-

[2013-02-01 10:08:03,861][DEBUG][action.search.type ] [1] failed
to reduce search
org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to
execute phase [fetch], [reduce]
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:177)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:155)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:1)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:345)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(TransportSearchQueryThenFetchAction.java:149)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2.run(TransportSearchQueryThenFetchAction.java:136)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassCastException:
org.elasticsearch.search.facet.termsstats.longs.InternalTermsStatsLongFacet
cannot be cast to
org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti
at
org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti.reduce(InternalTermsStatsStringFacetMulti.java:490)
at
org.elasticsearch.plugin.multifssearch.TermsStatsFacetProcessorMulti.reduce(TermsStatsFacetProcessorMulti.java:166)
at
org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:296)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.innerFinishHim(TransportSearchQueryThenFetchAction.java:190)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:175)
... 8 more
[2013-02-01 12:59:42,516][INFO ][cluster.metadata ] [1]
[TestDoc2] creating index, cause [auto(bulk api)], shards [5]/[0],
mappings [TestDoc2~type1]
[2013-02-01 13:00:34,555][INFO ][cluster.metadata ] [1]
[TestDoc3] creating index, cause [auto(bulk api)], shards [5]/[0],
mappings [TestDoc3~type1]

The Cluster health api from node1;

{
"cluster_name" : "test1",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 4256,
"active_shards" : 4256,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 4209
}

The Cluster health api from node2;
{
"cluster_name" : "test",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 8471,
"active_shards" : 8471,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 0
}

I looked through the ES group but could not find the exact issue.
It looks like one of the node ( primary) left the cluster because of
the network issue( not sure what was the issue, assuming network
issue). And the secondary got elected as master. And when the network
issue was resolved. The primary node was trying to join the cluster,
which did happen. But probably the state was not synched? or there two
master nodes master1- having two node in cluster, but not able to
communicate with data node. master2- having only one node in cluster.

Please help me as this is going crazy over my head. I looked through
the different threads, but nothing concrete.

Thanks in advance
Amit

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

kimchy · February 2, 2013, 10:12pm

Jorg mentioned the important minimum_master_nodes, which version of ES are you on?

On Feb 1, 2013, at 11:18 PM, Jörg Prante joergprante@gmail.com wrote:

Yes, it looks like a crash or a split. The problematic indexes are lost, and should be deleted. Otherwise, ES is not able to resolve the conflict. Note, there are precautions against such node splits, did you change the default settings in zen discovery, minimum_master_nodes for example?

Best regards,

Jörg

Am 01.02.13 19:01, schrieb Amit Singh:

Hi All,

I am having an ES cluster with 2 nodes. I am not as to what caused this issue;

Node 2-

[2] received shard failed for [TestDocTestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:37:41,667][WARN ][indices.cluster ] [2] [TestDoc][2] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:108)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard ] [2] sending failed shard for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard ] [2] received shard failed for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:39:00,306][WARN ][discovery.zen ] [2] master should not receive new cluster state from [[1][IlJPr1CBTmKxSgHyHJ7brg][inet[/10.190.209.134:9300]]]

Node1-

[2013-02-01 10:08:03,861][DEBUG][action.search.type ] [1] failed to reduce search
org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to execute phase [fetch], [reduce]
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:177)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:155)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:1)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:345)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(TransportSearchQueryThenFetchAction.java:149)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2.run(TransportSearchQueryThenFetchAction.java:136)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassCastException: org.elasticsearch.search.facet.termsstats.longs.InternalTermsStatsLongFacet cannot be cast to org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti
at org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti.reduce(InternalTermsStatsStringFacetMulti.java:490)
at org.elasticsearch.plugin.multifssearch.TermsStatsFacetProcessorMulti.reduce(TermsStatsFacetProcessorMulti.java:166)
at org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:296)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.innerFinishHim(TransportSearchQueryThenFetchAction.java:190)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:175)
... 8 more
[2013-02-01 12:59:42,516][INFO ][cluster.metadata ] [1] [TestDoc2] creating index, cause [auto(bulk api)], shards [5]/[0], mappings [TestDoc2~type1]
[2013-02-01 13:00:34,555][INFO ][cluster.metadata ] [1] [TestDoc3] creating index, cause [auto(bulk api)], shards [5]/[0], mappings [TestDoc3~type1]

The Cluster health api from node1;

{
"cluster_name" : "test1",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 4256,
"active_shards" : 4256,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 4209
}

The Cluster health api from node2;
{
"cluster_name" : "test",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 8471,
"active_shards" : 8471,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 0
}

I looked through the ES group but could not find the exact issue.
It looks like one of the node ( primary) left the cluster because of the network issue( not sure what was the issue, assuming network issue). And the secondary got elected as master. And when the network issue was resolved. The primary node was trying to join the cluster, which did happen. But probably the state was not synched? or there two master nodes master1- having two node in cluster, but not able to communicate with data node. master2- having only one node in cluster.

Please help me as this is going crazy over my head. I looked through the different threads, but nothing concrete.

Thanks in advance
Amit

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Amit · February 3, 2013, 5:08am

Thanks Jorg and kimchy,

I am using ES 0.19.4 version and minimum_master_nodes setting is default.
Since I have only two nodes in the cluster, I did not change
the minimum_master_nodes, as N/2+1 will give me value of 1 for two nodes.
Further when I restarted the cluster I am still getting the same error!

Please suggest.

Thanks
Amit

On Sunday, February 3, 2013 3:42:02 AM UTC+5:30, kimchy wrote:

Jorg mentioned the important minimum_master_nodes, which version of ES are
you on?

On Feb 1, 2013, at 11:18 PM, Jörg Prante <joerg...@gmail.com <javascript:>>
wrote:

Yes, it looks like a crash or a split. The problematic indexes are lost,
and should be deleted. Otherwise, ES is not able to resolve the conflict.
Note, there are precautions against such node splits, did you change the
default settings in zen discovery, minimum_master_nodes for example?

Best regards,

Jörg

Am 01.02.13 19:01, schrieb Amit Singh:

Hi All,

I am having an ES cluster with 2 nodes. I am not as to what caused this
issue;

Node 2-

[2] received shard failed for [TestDocTestDoc][2],
node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start
shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard
allocated for local recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:37:41,667][WARN ][indices.cluster ] [2]
[TestDoc][2] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[TestDoc][2] shard allocated for local recovery (post api), should exists,
but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:108)

at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)

at java.lang.Thread.run(Thread.java:636)
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard ] [2] sending
failed shard for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard ] [2]
received shard failed for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:39:00,306][WARN ][discovery.zen ] [2] master
should not receive new cluster state from
[[1][IlJPr1CBTmKxSgHyHJ7brg][inet[/10.190.209.134:9300]]]

Node1-

[2013-02-01 10:08:03,861][DEBUG][action.search.type ] [1] failed
to reduce search
org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to
execute phase [fetch], [reduce]
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:177)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:155)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:1)

at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:345)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(TransportSearchQueryThenFetchAction.java:149)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2.run(TransportSearchQueryThenFetchAction.java:136)

at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassCastException:
org.elasticsearch.search.facet.termsstats.longs.InternalTermsStatsLongFacet
cannot be cast to
org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti
at
org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti.reduce(InternalTermsStatsStringFacetMulti.java:490)

at
org.elasticsearch.plugin.multifssearch.TermsStatsFacetProcessorMulti.reduce(TermsStatsFacetProcessorMulti.java:166)

at
org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:296)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.innerFinishHim(TransportSearchQueryThenFetchAction.java:190)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:175)

... 8 more
[2013-02-01 12:59:42,516][INFO ][cluster.metadata ] [1]
[TestDoc2] creating index, cause [auto(bulk api)], shards [5]/[0], mappings
[TestDoc2~type1]
[2013-02-01 13:00:34,555][INFO ][cluster.metadata ] [1]
[TestDoc3] creating index, cause [auto(bulk api)], shards [5]/[0], mappings
[TestDoc3~type1]

The Cluster health api from node1;

{
"cluster_name" : "test1",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 4256,
"active_shards" : 4256,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 4209
}

The Cluster health api from node2;
{
"cluster_name" : "test",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 8471,
"active_shards" : 8471,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 0
}

I looked through the ES group but could not find the exact issue.
It looks like one of the node ( primary) left the cluster because of
the network issue( not sure what was the issue, assuming network issue).
And the secondary got elected as master. And when the network issue was
resolved. The primary node was trying to join the cluster, which did
happen. But probably the state was not synched? or there two master nodes
master1- having two node in cluster, but not able to communicate with data
node. master2- having only one node in cluster.

Please help me as this is going crazy over my head. I looked through
the different threads, but nothing concrete.

Thanks in advance
Amit

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Igor_Motov · February 4, 2013, 6:59pm

Are these the actual responses or you changed the cluster names before
posting them on the mailing list?

The Cluster health api from node1;

{
"cluster_name" : "test1",
"status" : "red",

The Cluster health api from node2;
{
"cluster_name" : "test",
"status" : "red",

On Sunday, February 3, 2013 12:08:56 AM UTC-5, Amit Singh wrote:

Thanks Jorg and kimchy,

I am using ES 0.19.4 version and minimum_master_nodes setting is default.
Since I have only two nodes in the cluster, I did not change
the minimum_master_nodes, as N/2+1 will give me value of 1 for two nodes.
Further when I restarted the cluster I am still getting the same error!

Please suggest.

Thanks
Amit

On Sunday, February 3, 2013 3:42:02 AM UTC+5:30, kimchy wrote:

Jorg mentioned the important minimum_master_nodes, which version of ES
are you on?

On Feb 1, 2013, at 11:18 PM, Jörg Prante joerg...@gmail.com wrote:

Yes, it looks like a crash or a split. The problematic indexes are
lost, and should be deleted. Otherwise, ES is not able to resolve the
conflict. Note, there are precautions against such node splits, did you
change the default settings in zen discovery, minimum_master_nodes for
example?

Best regards,

Jörg

Am 01.02.13 19:01, schrieb Amit Singh:

Hi All,

I am having an ES cluster with 2 nodes. I am not as to what caused
this issue;

Node 2-

[2] received shard failed for [TestDocTestDoc][2],
node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start
shard, message [IndexShardGatewayRecoveryException[[TestDoc][2] shard
allocated for local recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:37:41,667][WARN ][indices.cluster ] [2]
[TestDoc][2] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[TestDoc][2] shard allocated for local recovery (post api), should exists,
but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:108)

at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)

at java.lang.Thread.run(Thread.java:636)
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard ] [2]
sending failed shard for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard ] [2]
received shard failed for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:39:00,306][WARN ][discovery.zen ] [2] master
should not receive new cluster state from
[[1][IlJPr1CBTmKxSgHyHJ7brg][inet[/10.190.209.134:9300]]]

Node1-

[2013-02-01 10:08:03,861][DEBUG][action.search.type ] [1] failed
to reduce search
org.elasticsearch.action.search.ReduceSearchPhaseException: Failed to
execute phase [fetch], [reduce]
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:177)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:155)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(TransportSearchQueryThenFetchAction.java:1)

at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:345)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(TransportSearchQueryThenFetchAction.java:149)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2.run(TransportSearchQueryThenFetchAction.java:136)

at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassCastException:
org.elasticsearch.search.facet.termsstats.longs.InternalTermsStatsLongFacet
cannot be cast to
org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti
at
org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetMulti.reduce(InternalTermsStatsStringFacetMulti.java:490)

at
org.elasticsearch.plugin.multifssearch.TermsStatsFacetProcessorMulti.reduce(TermsStatsFacetProcessorMulti.java:166)

at
org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:296)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.innerFinishHim(TransportSearchQueryThenFetchAction.java:190)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:175)

... 8 more
[2013-02-01 12:59:42,516][INFO ][cluster.metadata ] [1]
[TestDoc2] creating index, cause [auto(bulk api)], shards [5]/[0], mappings
[TestDoc2~type1]
[2013-02-01 13:00:34,555][INFO ][cluster.metadata ] [1]
[TestDoc3] creating index, cause [auto(bulk api)], shards [5]/[0], mappings
[TestDoc3~type1]

The Cluster health api from node1;

{
"cluster_name" : "test1",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 4256,
"active_shards" : 4256,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 4209
}

The Cluster health api from node2;
{
"cluster_name" : "test",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 8471,
"active_shards" : 8471,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 0
}

I looked through the ES group but could not find the exact issue.
It looks like one of the node ( primary) left the cluster because of
the network issue( not sure what was the issue, assuming network issue).
And the secondary got elected as master. And when the network issue was
resolved. The primary node was trying to join the cluster, which did
happen. But probably the state was not synched? or there two master nodes
master1- having two node in cluster, but not able to communicate with data
node. master2- having only one node in cluster.

Please help me as this is going crazy over my head. I looked through
the different threads, but nothing concrete.

Thanks in advance
Amit

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

radu_gheorghe · February 4, 2013, 7:26pm

Hi,

For two nodes you should have minimum_master_nodes=2, otherwise you can end
up with 2 1-node clusters. And 2/2+1=2
On Feb 3, 2013 7:09 AM, "Amit Singh" amitsingh.kec@gmail.com wrote:

Thanks Jorg and kimchy,

I am using ES 0.19.4 version and minimum_master_nodes setting is default.
Since I have only two nodes in the cluster, I did not change
the minimum_master_nodes, as N/2+1 will give me value of 1 for two nodes.
Further when I restarted the cluster I am still getting the same error!

Please suggest.

Thanks
Amit

On Sunday, February 3, 2013 3:42:02 AM UTC+5:30, kimchy wrote:

Jorg mentioned the important minimum_master_nodes, which version of ES
are you on?

On Feb 1, 2013, at 11:18 PM, Jörg Prante joerg...@gmail.com wrote:

Yes, it looks like a crash or a split. The problematic indexes are
lost, and should be deleted. Otherwise, ES is not able to resolve the
conflict. Note, there are precautions against such node splits, did you
change the default settings in zen discovery, minimum_master_nodes for
example?

Best regards,

Jörg

Am 01.02.13 19:01, schrieb Amit Singh:

Hi All,

I am having an ES cluster with 2 nodes. I am not as to what caused
this issue;

Node 2-

[2] received shard failed for [TestDocTestDoc][2],
node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start
shard, message [IndexShardGatewayRecoveryException[[TestDoc][2]
shard allocated for local recovery (post api), should exists, but
doesn't]]]
[2013-02-01 16:37:41,667][WARN ][indices.cluster ] [2]
[TestDoc][2] failed to start shard
org.elasticsearch.index.**gateway.IndexShardGatewayRecoveryException:
[TestDoc][2] shard allocated for local recovery (post api), should exists,
but doesn't
at org.elasticsearch.index.**gateway.local.LocalIndexShardGateway.
recover(**LocalIndexShardGateway.java:**108)
at org.elasticsearch.index.**gateway.IndexShardGatewayService$1.
run(IndexShardGatewayService.**java:177)
at java.util.concurrent.**ThreadPoolExecutor.runWorker(**ThreadPoolExecutor.java:1110)

at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**ThreadPoolExecutor.java:603)

at java.lang.Thread.run(Thread.java:636)
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard ] [2]
sending failed shard for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P],
s[INITIALIZING], reason [Failed to start shard, message [
IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for
local recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard ] [2]
received shard failed for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P],
s[INITIALIZING], reason [Failed to start shard, message [
IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for
local recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:39:00,306][WARN ][discovery.zen ] [2] master
should not receive new cluster state from [[1][IlJPr1CBTmKxSgHyHJ7brg][
inet[/10.190.209.134:9300]]]

Node1-

[2013-02-01 10:08:03,861][DEBUG][action.search.type ] [1]
failed to reduce search
org.elasticsearch.action.search.ReduceSearchPhaseException:
Failed to execute phase [fetch], [reduce]
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction.finishHim(
TransportSearchQueryThenFetchAction.java:177)
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(
TransportSearchQueryThenFetchAction.java:155)
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(**
TransportSearchQueryThenFetchAction.java:1)
at org.elasticsearch.search.action.SearchServiceTransportAction.
sendExecuteFetch(SearchServiceTransportAction.java:345)
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(
TransportSearchQueryThenFetchAction.java:149)
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction$2.run(
TransportSearchQueryThenFetchA**ction.java:136)
at java.util.concurrent.ThreadPoolExecutor$Worker.
runTask(ThreadPoolExecutor.**java:886)
at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassCastException: org.elasticsearch.search.
facet.termsstats.longs.**InternalTermsStatsLongFacet cannot be cast to
org.elasticsearch.plugin.multifssearch.InternalTermsStatsStringFacetM
ulti
at org.elasticsearch.plugin.multifssearch.
InternalTermsStatsStringFacetMulti.reduce(
InternalTermsStatsStringFacetMulti.java:490)
at org.elasticsearch.plugin.multifssearch.
TermsStatsFacetProcessorMulti.**reduce(**TermsStatsFacetProcessorMulti.**java:166)

at org.elasticsearch.search.controller.
SearchPhaseController.merge(SearchPhaseController.java:296)
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction.innerFinishHim(
TransportSearchQueryThenFetchAction.java:190)
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction.finishHim(
TransportSearchQueryThenFetchA**ction.java:175)
... 8 more
[2013-02-01 12:59:42,516][INFO ][cluster.metadata ] [1]
[TestDoc2] creating index, cause [auto(bulk api)], shards [5]/[0], mappings
[TestDoc2~type1]
[2013-02-01 13:00:34,555][INFO ][cluster.metadata ] [1]
[TestDoc3] creating index, cause [auto(bulk api)], shards [5]/[0], mappings
[TestDoc3~type1]

The Cluster health api from node1;

{
"cluster_name" : "test1",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 4256,
"active_shards" : 4256,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 4209
}

The Cluster health api from node2;
{
"cluster_name" : "test",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 8471,
"active_shards" : 8471,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 0
}

I looked through the ES group but could not find the exact issue.
It looks like one of the node ( primary) left the cluster because of
the network issue( not sure what was the issue, assuming network issue).
And the secondary got elected as master. And when the network issue was
resolved. The primary node was trying to join the cluster, which did
happen. But probably the state was not synched? or there two master nodes
master1- having two node in cluster, but not able to communicate with data
node. master2- having only one node in cluster.

Please help me as this is going crazy over my head. I looked through
the different threads, but nothing concrete.

Thanks in advance
Amit

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Amit · February 13, 2013, 8:56am

Hi Radu,

Thanks for the response.

As per the suggestion I have made the changes in the ES config
"minimum_master_nodes" to *2. *And since it was time to add a new node in
cluster. I added one more node in cluster ( total 3 nodes)
and "minimum_master_nodes" to 2.
*
*
All my nodes have some configuration.
Java max heap- 12g ; memory available on system -16g (AWS - XLarge)
Data drives to hold the data- 4 drives - 500gb each.
*
*
1-When I restarted the cluster, lot of shards got relocated to the new node
and when the cluster health became stable. The new node ( say node3 ) was
holding data anywhere between 45-50 % of the total data. I assumed the data
distribution among the nodes to be uniform?

2- When the system is creating new Indexes. The average load for the new
node3 is going very high and is constantly between 100-200. When I
compare the size of the data across all the node, 85-90 % of the new index
data goes to the new node3?

3- Change in the lucene merge setting will help or not.
like index.merge.policy.max_merge_at_once setting this to a higher value or
index.merge.policy.min_merge_size to a higher value.

Please help me understand the above.

I read through the post;

I am using ES 0.19.4 version and oracle java 1.6.0_31. Is this the best
combination or do I need to change the java version to 7 or any other
version of java. I cannot change the es vesrion, as i have dependency
around it.

When I add a new node to a live cluster, what is expected behavior of ES. I
mean since the es will be busy relocating the shards. What will the impact
on indexing new data and performing search on the existing data.

My apology for the lengthy post as I did not expected it to be!

Thanks
Amit

On Tuesday, February 5, 2013 12:56:20 AM UTC+5:30, Radu Gheorghe wrote:

Hi,

For two nodes you should have minimum_master_nodes=2, otherwise you can
end up with 2 1-node clusters. And 2/2+1=2
On Feb 3, 2013 7:09 AM, "Amit Singh" <amitsi...@gmail.com <javascript:>>
wrote:

Thanks Jorg and kimchy,

I am using ES 0.19.4 version and minimum_master_nodes setting is default.
Since I have only two nodes in the cluster, I did not change
the minimum_master_nodes, as N/2+1 will give me value of 1 for two nodes.
Further when I restarted the cluster I am still getting the same error!

Please suggest.

Thanks
Amit

On Sunday, February 3, 2013 3:42:02 AM UTC+5:30, kimchy wrote:

Jorg mentioned the important minimum_master_nodes, which version of ES
are you on?

On Feb 1, 2013, at 11:18 PM, Jörg Prante joerg...@gmail.com wrote:

Yes, it looks like a crash or a split. The problematic indexes are
lost, and should be deleted. Otherwise, ES is not able to resolve the
conflict. Note, there are precautions against such node splits, did you
change the default settings in zen discovery, minimum_master_nodes for
example?

Best regards,

Jörg

Am 01.02.13 19:01, schrieb Amit Singh:

Hi All,

I am having an ES cluster with 2 nodes. I am not as to what caused
this issue;

Node 2-

[2] received shard failed for [TestDocTestDoc][2],
node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start
shard, message [IndexShardGatewayRecoveryException[[TestDoc][2]
shard allocated for local recovery (post api), should exists, but
doesn't]]]
[2013-02-01 16:37:41,667][WARN ][indices.cluster ] [2]
[TestDoc][2] failed to start shard
org.elasticsearch.index.**gateway.IndexShardGatewayRecoveryException:
[TestDoc][2] shard allocated for local recovery (post api), should exists,
but doesn't
at org.elasticsearch.index.**gateway.local.*LocalIndexShardGateway.
*recover(**LocalIndexShardGateway.java:**108)
at org.elasticsearch.index.**gateway.IndexShardGatewayService$1.
run(IndexShardGatewayService.**java:177)
at java.util.concurrent.**ThreadPoolExecutor.runWorker(**ThreadPoolExecutor.java:1110)

at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**ThreadPoolExecutor.java:603)

at java.lang.Thread.run(Thread.java:636)
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard ] [2]
sending failed shard for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P],
s[INITIALIZING], reason [Failed to start shard, message [
IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated for
local recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard ] [2]
received shard failed for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P],
s[INITIALIZING], reason [Failed to start shard, message [
IndexShardGatewayRecoveryExcep**tion[[TestDoc][2] shard allocated for
local recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:39:00,306][WARN ][discovery.zen ] [2]
master should not receive new cluster state from
[[1][IlJPr1CBTmKxSgHyHJ7brg][**inet[/10.190.209.134:9300]]]

Node1-

[2013-02-01 10:08:03,861][DEBUG][action.search.type ] [1]
failed to reduce search
org.elasticsearch.action.search.ReduceSearchPhaseException:
Failed to execute phase [fetch], [reduce]
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction.finishHim(
TransportSearchQueryThenFetchAction.java:177)
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(
TransportSearchQueryThenFetchAction.java:155)
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(**
TransportSearchQueryThenFetchAction.java:1)
at org.elasticsearch.search.action.SearchServiceTransportAction.
*sendExecuteFetch(SearchServiceTransportAction.java:345)
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(
TransportSearchQueryThenFetchAction.java:149)
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction$2.run(
TransportSearchQueryThenFetchA*ction.java:136)
at java.util.concurrent.ThreadPoolExecutor$Worker.
runTask(ThreadPoolExecutor.**java:886)
at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassCastException: org.elasticsearch.search.
facet.termsstats.longs.InternalTermsStatsLongFacet cannot be cast to
org.elasticsearch.plugin.multifssearch.
InternalTermsStatsStringFacetMulti
at org.elasticsearch.plugin.multifssearch.
InternalTermsStatsStringFacetMulti.reduce(
InternalTermsStatsStringFacetMulti.java:490)
at org.elasticsearch.plugin.multifssearch.
TermsStatsFacetProcessorMulti.reduce(TermsStatsFacetProcessorMulti.
*java:166)
at org.elasticsearch.search.controller.
SearchPhaseController.merge(SearchPhaseController.java:296)
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction.innerFinishHim(
TransportSearchQueryThenFetchAction.java:190)
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction.finishHim(
TransportSearchQueryThenFetchA*ction.java:175)
... 8 more
[2013-02-01 12:59:42,516][INFO ][cluster.metadata ] [1]
[TestDoc2] creating index, cause [auto(bulk api)], shards [5]/[0], mappings
[TestDoc2~type1]
[2013-02-01 13:00:34,555][INFO ][cluster.metadata ] [1]
[TestDoc3] creating index, cause [auto(bulk api)], shards [5]/[0], mappings
[TestDoc3~type1]

The Cluster health api from node1;

{
"cluster_name" : "test1",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 4256,
"active_shards" : 4256,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 4209
}

The Cluster health api from node2;
{
"cluster_name" : "test",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 8471,
"active_shards" : 8471,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 0
}

I looked through the ES group but could not find the exact issue.
It looks like one of the node ( primary) left the cluster because of
the network issue( not sure what was the issue, assuming network issue).
And the secondary got elected as master. And when the network issue was
resolved. The primary node was trying to join the cluster, which did
happen. But probably the state was not synched? or there two master nodes
master1- having two node in cluster, but not able to communicate with data
node. master2- having only one node in cluster.

Please help me as this is going crazy over my head. I looked through
the different threads, but nothing concrete.

Thanks in advance
Amit

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

radu_gheorghe · February 13, 2013, 2:12pm

Hello Amit,

Normally, Elasticsearch tries to balance the number of shards across nodes.
It doesn't look at how much data is in these shards or which index the
shards belong to.

That might explain your situation, but I'm not sure. If it doesn't make
sense to you, please say some more about your index setup. Stuff like how
many indices you have, how many shards per index, which kind of documents
go in which index and what's the size of each shard.

The good news is you can configure Elasticsearch to allocate shards in
various ways. Take at these links:

Although I think the last one is not available in 0.19.4.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Wed, Feb 13, 2013 at 10:56 AM, Amit Singh amitsingh.kec@gmail.comwrote:

Hi Radu,

Thanks for the response.

As per the suggestion I have made the changes in the ES config
"minimum_master_nodes" to *2. *And since it was time to add a new node in
cluster. I added one more node in cluster ( total 3 nodes)
and "minimum_master_nodes" to 2.
*
*
All my nodes have some configuration.
Java max heap- 12g ; memory available on system -16g (AWS - XLarge)
Data drives to hold the data- 4 drives - 500gb each.
*
*
1-When I restarted the cluster, lot of shards got relocated to the new
node and when the cluster health became stable. The new node ( say node3 )
was holding data anywhere between 45-50 % of the total data. I assumed the
data distribution among the nodes to be uniform?

2- When the system is creating new Indexes. The average load for the new
node3 is going very high and is constantly between 100-200. When I
compare the size of the data across all the node, 85-90 % of the new index
data goes to the new node3?

3- Change in the lucene merge setting will help or not.
like index.merge.policy.max_merge_at_once setting this to a higher value or
index.merge.policy.min_merge_size to a higher value.

Please help me understand the above.

I read through the post;
Recent Code Search Outages - The GitHub Blog

I am using ES 0.19.4 version and oracle java 1.6.0_31. Is this the best
combination or do I need to change the java version to 7 or any other
version of java. I cannot change the es vesrion, as i have dependency
around it.

When I add a new node to a live cluster, what is expected behavior of ES.
I mean since the es will be busy relocating the shards. What will the
impact on indexing new data and performing search on the existing data.

My apology for the lengthy post as I did not expected it to be!

Thanks
Amit

On Tuesday, February 5, 2013 12:56:20 AM UTC+5:30, Radu Gheorghe wrote:

Hi,

For two nodes you should have minimum_master_nodes=2, otherwise you can
end up with 2 1-node clusters. And 2/2+1=2
On Feb 3, 2013 7:09 AM, "Amit Singh" amitsi...@gmail.com wrote:

Thanks Jorg and kimchy,

I am using ES 0.19.4 version and minimum_master_nodes setting is default.
Since I have only two nodes in the cluster, I did not change
the minimum_master_nodes, as N/2+1 will give me value of 1 for two nodes.
Further when I restarted the cluster I am still getting the same error!

Please suggest.

Thanks
Amit

On Sunday, February 3, 2013 3:42:02 AM UTC+5:30, kimchy wrote:

Jorg mentioned the important minimum_master_nodes, which version of ES
are you on?

On Feb 1, 2013, at 11:18 PM, Jörg Prante joerg...@gmail.com wrote:

Yes, it looks like a crash or a split. The problematic indexes are
lost, and should be deleted. Otherwise, ES is not able to resolve the
conflict. Note, there are precautions against such node splits, did you
change the default settings in zen discovery, minimum_master_nodes for
example?

Best regards,

Jörg

Am 01.02.13 19:01, schrieb Amit Singh:

Hi All,

I am having an ES cluster with 2 nodes. I am not as to what caused
this issue;

Node 2-

[2] received shard failed for [TestDocTestDoc][2],
node[J368dRSdRxOUkTOEqIOsHg], [P], s[INITIALIZING], reason [Failed to start
shard, message [IndexShardGatewayRecoveryException[[TestDoc][2]
shard allocated for local recovery (post api), should exists, but
doesn't]]]
[2013-02-01 16:37:41,667][WARN ][indices.cluster ] [2]
[TestDoc][2] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryExc
eption: [TestDoc][2] shard allocated for local recovery (post api),
should exists, but doesn't
at org.elasticsearch.index.gateway.local.**
LocalIndexShardGateway**.**recover(LocalIndexShardGateway.java:**108)

at org.elasticsearch.index.gateway.IndexShardGatewayService$1.
run(IndexShardGatewayService.java:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard ] [2]
sending failed shard for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P],
s[INITIALIZING], reason [Failed to start shard, message [
IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated
for local recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:37:41,834][WARN ][cluster.action.shard ] [2]
received shard failed for [TestDoc][2], node[J368dRSdRxOUkTOEqIOsHg], [P],
s[INITIALIZING], reason [Failed to start shard, message [**
IndexShardGatewayRecoveryException[[TestDoc][2] shard allocated
for local recovery (post api), should exists, but doesn't]]]
[2013-02-01 16:39:00,306][WARN ][discovery.zen ] [2]
master should not receive new cluster state from
[[1][IlJPr1CBTmKxSgHyHJ7brg][inet[/10.190.209.134:9300]]]

Node1-

[2013-02-01 10:08:03,861][DEBUG][action.search.type ] [1]
failed to reduce search
org.elasticsearch.action.search.ReduceSearchPhaseException:
Failed to execute phase [fetch], [reduce]
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction.finishHim(
TransportSearchQueryThenFetchAction.java:177)
at org.elasticsearch.action.search.type.**
TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(
TransportSearchQueryThenFetchAction.java:155)
at org.elasticsearch.action.search.type.**
TransportSearchQueryThenFetchAction$AsyncAction$3.onResult(
TransportSearchQueryThenFetchAction.java:1)
at org.elasticsearch.search.action.**
SearchServiceTransportAction**.sendExecuteFetch(SearchServic
eTransportAction.java:345)
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(
TransportSearchQueryThenFetchAction.java:149)
at org.elasticsearch.action.search.type.**
TransportSearchQueryThenFetchAction$AsyncAction$2.run(
TransportSearchQueryThenFetchAction.java:136)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(
ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ClassCastException: org.elasticsearch.search.
facet**.termsstats.longs.InternalTermsStatsLongFacet cannot be
cast to org.elasticsearch.plugin.multifssearch.**
InternalTermsStatsStringFacetMulti
at org.elasticsearch.plugin.multifssearch.**
InternalTermsStatsStringFacetMulti.reduce(InternalT
ermsStatsStringFacetMulti.java:490)
at org.elasticsearch.plugin.multifssearch.
TermsStatsFacetProcessorMulti.reduce(TermsStatsFace
tProcessorMulti.java:166)
at org.elasticsearch.search.controller.SearchPhaseController.
*merge(SearchPhaseController.java:296)
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction.innerFinishHim(
TransportSearchQueryThenFetchAction.java:190)
at org.elasticsearch.action.search.type.
TransportSearchQueryThenFetchAction$AsyncAction.finishHim(
TransportSearchQueryThenFetchA*ction.java:175)
... 8 more
[2013-02-01 12:59:42,516][INFO ][cluster.metadata ] [1]
[TestDoc2] creating index, cause [auto(bulk api)], shards [5]/[0], mappings
[TestDoc2~type1]
[2013-02-01 13:00:34,555][INFO ][cluster.metadata ] [1]
[TestDoc3] creating index, cause [auto(bulk api)], shards [5]/[0], mappings
[TestDoc3~type1]

The Cluster health api from node1;

{
"cluster_name" : "test1",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 4256,
"active_shards" : 4256,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 4209
}

The Cluster health api from node2;
{
"cluster_name" : "test",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 8471,
"active_shards" : 8471,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 0
}

I looked through the ES group but could not find the exact issue.
It looks like one of the node ( primary) left the cluster because of
the network issue( not sure what was the issue, assuming network issue).
And the secondary got elected as master. And when the network issue was
resolved. The primary node was trying to join the cluster, which did
happen. But probably the state was not synched? or there two master nodes
master1- having two node in cluster, but not able to communicate with data
node. master2- having only one node in cluster.

Please help me as this is going crazy over my head. I looked through
the different threads, but nothing concrete.

Thanks in advance
Amit

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@**googlegroups.**com.
For more options, visit https://groups.google.com/**grou**ps/opt_out https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Shard allocated for local recovery (post api), should exists, but doesn't Elasticsearch	1	724	July 6, 2017
Failed to start shard Elasticsearch	7	380	July 6, 2017
Failed to start shard Elasticsearch	1	235	July 6, 2017
Cluster Failure Elasticsearch	2	240	July 6, 2017
Index Shard Gateway Recovery Exception Elasticsearch	1	1876	July 5, 2017

Very weird ES Cluster state problem!

Thanks in advance Amit

Thanks in advance Amit

Thanks in advance Amit

Thanks in advance Amit

Thanks in advance Amit

Thanks in advance Amit

Best regards, Radu

Thanks in advance Amit

Related topics

Thanks in advance
Amit

Thanks in advance
Amit

Thanks in advance
Amit

Thanks in advance
Amit

Thanks in advance
Amit

Thanks in advance
Amit

Best regards,
Radu

Thanks in advance
Amit