Cannot post new document to elasticsearch


(Gaurav Vijayvargiya) #1

Hi

We had a 5 node cluster (5 data nodes and 1 non-date master). One of
the data nodes went down. Now when we try to post anything to the
cluster it just hangs for ever. There is nothing in the logs.

$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "merchant_service_test",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 4,
"active_primary_shards" : 20,
"active_shards" : 38,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 0
}

$ curl -XPOST 'http://localhost:9200/m3/_optimize'
{"ok":true,"_shards":{"total":40,"successful":37,"failed":1,"failures":
[{"index":"m3","shard":
1,"reason":"BroadcastShardOperationFailedException[[m3][1] ]; nested:
RemoteTransportException[[Arkon][inet[/10.188.214.210:9300]][indices/
optimize/shard]]; nested: FlushNotAllowedEngineException[[m3][1]
Already flushing...]; "}]}}

Its been like this for at least few hours now. I don't see much CPU/IO
activity on any of the nodes.

Any suggestions?


(Gaurav Vijayvargiya) #2

Update,

When I try to do a Get on an existing document, I alternately get it
immediately or just wait for repsonse for ever.
My number or replicas is 2. My guess is that one of the replicas is
unavailable, thus when the query goes to that replica, it hangs.
Whereas next time when the query goes to the other (good) replica, it
returns immediately.

Whereas for post it needs to write to both the replicas and thus it
hangs.

How can I fix the "initializing shards" issue?

Gaurav

On Mar 27, 7:57 am, Gaurav Vijayvargiya gvijayvarg...@gmail.com
wrote:

Hi

We had a 5 node cluster (5 data nodes and 1 non-date master). One of
the data nodes went down. Now when we try to post anything to the
cluster it just hangs for ever. There is nothing in the logs.

$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "merchant_service_test",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 4,
"active_primary_shards" : 20,
"active_shards" : 38,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 0

}

$ curl -XPOST 'http://localhost:9200/m3/_optimize'
{"ok":true,"_shards":{"total":40,"successful":37,"failed":1,"failures":
[{"index":"m3","shard":
1,"reason":"BroadcastShardOperationFailedException[[m3][1] ]; nested:
RemoteTransportException[[Arkon][inet[/10.188.214.210:9300]][indices/
optimize/shard]]; nested: FlushNotAllowedEngineException[[m3][1]
Already flushing...]; "}]}}

Its been like this for at least few hours now. I don't see much CPU/IO
activity on any of the nodes.

Any suggestions?


(Shay Banon) #3

Do you see any failures in the logs for those initializing shards?

On Tue, Mar 27, 2012 at 7:21 PM, Gaurav Vijayvargiya <
gvijayvargiya@gmail.com> wrote:

Update,

When I try to do a Get on an existing document, I alternately get it
immediately or just wait for repsonse for ever.
My number or replicas is 2. My guess is that one of the replicas is
unavailable, thus when the query goes to that replica, it hangs.
Whereas next time when the query goes to the other (good) replica, it
returns immediately.

Whereas for post it needs to write to both the replicas and thus it
hangs.

How can I fix the "initializing shards" issue?

Gaurav

On Mar 27, 7:57 am, Gaurav Vijayvargiya gvijayvarg...@gmail.com
wrote:

Hi

We had a 5 node cluster (5 data nodes and 1 non-date master). One of
the data nodes went down. Now when we try to post anything to the
cluster it just hangs for ever. There is nothing in the logs.

$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "merchant_service_test",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 4,
"active_primary_shards" : 20,
"active_shards" : 38,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 0

}

$ curl -XPOST 'http://localhost:9200/m3/_optimize'
{"ok":true,"_shards":{"total":40,"successful":37,"failed":1,"failures":
[{"index":"m3","shard":
1,"reason":"BroadcastShardOperationFailedException[[m3][1] ]; nested:
RemoteTransportException[[Arkon][inet[/10.188.214.210:9300]][indices/
optimize/shard]]; nested: FlushNotAllowedEngineException[[m3][1]
Already flushing...]; "}]}}

Its been like this for at least few hours now. I don't see much CPU/IO
activity on any of the nodes.

Any suggestions?


(Gaurav Vijayvargiya) #4

Looked at the logs, there is nothing I could interpret.

Btw, both the failing (initializing) shard were on the same node. I
shut down that node and it recovered. So the good news is that the
cluster is up and running now and so far all the data seems to be
there.

But I still don't know what caused the problem? :frowning:

On Mar 27, 11:54 am, Shay Banon kim...@gmail.com wrote:

Do you see any failures in the logs for those initializing shards?

On Tue, Mar 27, 2012 at 7:21 PM, Gaurav Vijayvargiya <

gvijayvarg...@gmail.com> wrote:

Update,

When I try to do a Get on an existing document, I alternately get it
immediately or just wait for repsonse for ever.
My number or replicas is 2. My guess is that one of the replicas is
unavailable, thus when the query goes to that replica, it hangs.
Whereas next time when the query goes to the other (good) replica, it
returns immediately.

Whereas for post it needs to write to both the replicas and thus it
hangs.

How can I fix the "initializing shards" issue?

Gaurav

On Mar 27, 7:57 am, Gaurav Vijayvargiya gvijayvarg...@gmail.com
wrote:

Hi

We had a 5 node cluster (5 data nodes and 1 non-date master). One of
the data nodes went down. Now when we try to post anything to the
cluster it just hangs for ever. There is nothing in the logs.

$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "merchant_service_test",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 4,
"active_primary_shards" : 20,
"active_shards" : 38,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 0

}

$ curl -XPOST 'http://localhost:9200/m3/_optimize'
{"ok":true,"_shards":{"total":40,"successful":37,"failed":1,"failures":
[{"index":"m3","shard":
1,"reason":"BroadcastShardOperationFailedException[[m3][1] ]; nested:
RemoteTransportException[[Arkon][inet[/10.188.214.210:9300]][indices/
optimize/shard]]; nested: FlushNotAllowedEngineException[[m3][1]
Already flushing...]; "}]}}

Its been like this for at least few hours now. I don't see much CPU/IO
activity on any of the nodes.

Any suggestions?


(Shay Banon) #5

Strange..., which version are you using?

On Tue, Mar 27, 2012 at 10:06 PM, Gaurav Vijayvargiya <
gvijayvargiya@gmail.com> wrote:

Looked at the logs, there is nothing I could interpret.

Btw, both the failing (initializing) shard were on the same node. I
shut down that node and it recovered. So the good news is that the
cluster is up and running now and so far all the data seems to be
there.

But I still don't know what caused the problem? :frowning:

On Mar 27, 11:54 am, Shay Banon kim...@gmail.com wrote:

Do you see any failures in the logs for those initializing shards?

On Tue, Mar 27, 2012 at 7:21 PM, Gaurav Vijayvargiya <

gvijayvarg...@gmail.com> wrote:

Update,

When I try to do a Get on an existing document, I alternately get it
immediately or just wait for repsonse for ever.
My number or replicas is 2. My guess is that one of the replicas is
unavailable, thus when the query goes to that replica, it hangs.
Whereas next time when the query goes to the other (good) replica, it
returns immediately.

Whereas for post it needs to write to both the replicas and thus it
hangs.

How can I fix the "initializing shards" issue?

Gaurav

On Mar 27, 7:57 am, Gaurav Vijayvargiya gvijayvarg...@gmail.com
wrote:

Hi

We had a 5 node cluster (5 data nodes and 1 non-date master). One of
the data nodes went down. Now when we try to post anything to the
cluster it just hangs for ever. There is nothing in the logs.

$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "merchant_service_test",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 4,
"active_primary_shards" : 20,
"active_shards" : 38,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 0

}

$ curl -XPOST 'http://localhost:9200/m3/_optimize'

{"ok":true,"_shards":{"total":40,"successful":37,"failed":1,"failures":

[{"index":"m3","shard":
1,"reason":"BroadcastShardOperationFailedException[[m3][1] ]; nested:
RemoteTransportException[[Arkon][inet[/10.188.214.210:9300
]][indices/

optimize/shard]]; nested: FlushNotAllowedEngineException[[m3][1]
Already flushing...]; "}]}}

Its been like this for at least few hours now. I don't see much
CPU/IO

activity on any of the nodes.

Any suggestions?


(Gaurav Vijayvargiya) #6

elasticsearch-0.18.7

On Mar 28, 3:31 am, Shay Banon kim...@gmail.com wrote:

Strange..., which version are you using?

On Tue, Mar 27, 2012 at 10:06 PM, Gaurav Vijayvargiya <

gvijayvarg...@gmail.com> wrote:

Looked at the logs, there is nothing I could interpret.

Btw, both the failing (initializing) shard were on the same node. I
shut down that node and it recovered. So the good news is that the
cluster is up and running now and so far all the data seems to be
there.

But I still don't know what caused the problem? :frowning:

On Mar 27, 11:54 am, Shay Banon kim...@gmail.com wrote:

Do you see any failures in the logs for those initializing shards?

On Tue, Mar 27, 2012 at 7:21 PM, Gaurav Vijayvargiya <

gvijayvarg...@gmail.com> wrote:

Update,

When I try to do a Get on an existing document, I alternately get it
immediately or just wait for repsonse for ever.
My number or replicas is 2. My guess is that one of the replicas is
unavailable, thus when the query goes to that replica, it hangs.
Whereas next time when the query goes to the other (good) replica, it
returns immediately.

Whereas for post it needs to write to both the replicas and thus it
hangs.

How can I fix the "initializing shards" issue?

Gaurav

On Mar 27, 7:57 am, Gaurav Vijayvargiya gvijayvarg...@gmail.com
wrote:

Hi

We had a 5 node cluster (5 data nodes and 1 non-date master). One of
the data nodes went down. Now when we try to post anything to the
cluster it just hangs for ever. There is nothing in the logs.

$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "merchant_service_test",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 4,
"active_primary_shards" : 20,
"active_shards" : 38,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 0

}

$ curl -XPOST 'http://localhost:9200/m3/_optimize'

{"ok":true,"_shards":{"total":40,"successful":37,"failed":1,"failures":

[{"index":"m3","shard":
1,"reason":"BroadcastShardOperationFailedException[[m3][1] ]; nested:
RemoteTransportException[[Arkon][inet[/10.188.214.210:9300
]][indices/

optimize/shard]]; nested: FlushNotAllowedEngineException[[m3][1]
Already flushing...]; "}]}}

Its been like this for at least few hours now. I don't see much
CPU/IO

activity on any of the nodes.

Any suggestions?


(system) #7