Help understanding "UnavailableShardsException" error


(Nikhil Singh) #1

Hi all,

We have a 5 node elasticsearch setup as follows:

  • 3 nodes running in data mode
  • 2 nodes running in master only mode

Each of the nodes have the following configuration options set in
elasticsearch.yml file

index.number_of_shards: 6
index.number_of_replicas: 0
index.mapper.dynamic: true
action.auto_create_index: true
action.disable_delete_all_indices: true

What I understand from the above config is that each index will have 6
shards and there will not be any replica of each of the shards. Also, we
have the following discovery options set

discovery.zen.minimum_master_nodes: 1
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.101.55.93[9300-9400]",
"10.242.22.126[9300-9400]", "10.144.77.94[9300-9400]",
"10.116.173.148[9300-9400]", "10.224.42.205[9300-9400]"]

We are also running logstash which is indexing documents in this
elasticsearch setup. Yesterday, I started noticing the following errors in
the logstash indexer log file

{:timestamp=>"2014-03-10T11:05:44.453000+0000", :message=>"Failed to index
an event, will retry", :exception=>
org.elasticsearch.action.UnavailableShardsException:
[logstash-2014.03.10][1] [1] shardIt, [0] active : Timeout waiting for
[1m], request: index
{[logstash-2014.03.10][application][uMrD19E4QuKepoBq17ZKQA],
source[{"@source":"file://ip-10-118-115-235//mnt/deploy/apache-tomcat/logs/application.log","@tags":["shipped"],"@fields":{"environment":["production"],"service":["application"],"machine":["proxy1"],"timestamp":"03/10/14
06:03:07.292","thread":"http-8080-126","severity":"DEBUG","message":"ources.AccountServicesResource
idx=0"},"@timestamp":"2014-03-10T06:03:07.292Z","@source_host":"ip-10-118-115-235","@source_path":"//mnt/deploy/apache-tomcat/logs/application.log","@message":"03/10/14
06:03:07.292 [http-8080-126] DEBUG ources.AccountServicesResource
idx=0","@type":"application"}]},
:event=>{"@source"=>"file://ip-10-118-115-235//mnt/deploy/apache-tomcat/logs/application.log",
"@tags"=>["shipped"], "@fields"=>{"environment"=>["production"],
"service"=>["application"], "machine"=>["proxy1"], "timestamp"=>"03/10/14
06:03:07.292", "thread"=>"http-8080-126", "severity"=>"DEBUG",
"message"=>"ources.AccountServicesResource idx=0"},
"@timestamp"=>"2014-03-10T06:03:07.292Z",
"@source_host"=>"ip-10-118-115-235",
"@source_path"=>"//mnt/deploy/apache-tomcat/logs/application.log",
"@message"=>"03/10/14 06:03:07.292 [http-8080-126] DEBUG
ources.AccountServicesResource idx=0", "@type"=>"application"},
:level=>:warn}

While I was debugging this issue, I noticed that the first IP in the *discovery.zen.ping.unicast.hosts
*was wrong. That IP did not point to any of the 5 nodes in the cluster. I
realized that the first IP should be of one of the nodes configured as
master and I changed that IP to the correct one on all of the nodes and
restarted ES. After that change, I no longer see the above error.

I have a question - considering the first IP was wrong, the cluster would
have elected the only other node configured as master as the master. This
means that there was atleast one master in the cluster. So, for this
exception to happen, could the other master's metadata about shards be
wrong?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d5eedac2-d152-4ab5-98fd-4cc22566aced%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Nikhil Singh) #2

I forgot to mention the ES version that we are running. It is 0.90.3

On Wednesday, March 12, 2014 10:54:46 PM UTC+5:30, Nikhil Singh wrote:

Hi all,

We have a 5 node elasticsearch setup as follows:

  • 3 nodes running in data mode
  • 2 nodes running in master only mode

Each of the nodes have the following configuration options set in
elasticsearch.yml file

index.number_of_shards: 6
index.number_of_replicas: 0
index.mapper.dynamic: true
action.auto_create_index: true
action.disable_delete_all_indices: true

What I understand from the above config is that each index will have 6
shards and there will not be any replica of each of the shards. Also, we
have the following discovery options set

discovery.zen.minimum_master_nodes: 1
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.101.55.93[9300-9400]",
"10.242.22.126[9300-9400]", "10.144.77.94[9300-9400]",
"10.116.173.148[9300-9400]", "10.224.42.205[9300-9400]"]

We are also running logstash which is indexing documents in this
elasticsearch setup. Yesterday, I started noticing the following errors in
the logstash indexer log file

{:timestamp=>"2014-03-10T11:05:44.453000+0000", :message=>"Failed to
index an event, will retry", :exception=>
org.elasticsearch.action.UnavailableShardsException:
[logstash-2014.03.10][1] [1] shardIt, [0] active : Timeout waiting for
[1m], request: index
{[logstash-2014.03.10][application][uMrD19E4QuKepoBq17ZKQA],
source[{"@source":"file://ip-10-118-115-235//mnt/deploy/apache-tomcat/logs/application.log","@tags":["shipped"],"@fields":{"environment":["production"],"service":["application"],"machine":["proxy1"],"timestamp":"03/10/14
06:03:07.292","thread":"http-8080-126","severity":"DEBUG","message":"ources.AccountServicesResource
idx=0"},"@timestamp":"2014-03-10T06:03:07.292Z","@source_host":"ip-10-118-115-235","@source_path":"//mnt/deploy/apache-tomcat/logs/application.log","@message":"03/10/14
06:03:07.292 [http-8080-126] DEBUG ources.AccountServicesResource
idx=0","@type":"application"}]},
:event=>{"@source"=>"file://ip-10-118-115-235//mnt/deploy/apache-tomcat/logs/application.log",
"@tags"=>["shipped"], "@fields"=>{"environment"=>["production"],
"service"=>["application"], "machine"=>["proxy1"], "timestamp"=>"03/10/14
06:03:07.292", "thread"=>"http-8080-126", "severity"=>"DEBUG",
"message"=>"ources.AccountServicesResource idx=0"},
"@timestamp"=>"2014-03-10T06:03:07.292Z",
"@source_host"=>"ip-10-118-115-235",
"@source_path"=>"//mnt/deploy/apache-tomcat/logs/application.log",
"@message"=>"03/10/14 06:03:07.292 [http-8080-126] DEBUG
ources.AccountServicesResource idx=0", "@type"=>"application"},
:level=>:warn}

While I was debugging this issue, I noticed that the first IP in the *discovery.zen.ping.unicast.hosts
*was wrong. That IP did not point to any of the 5 nodes in the cluster. I
realized that the first IP should be of one of the nodes configured as
master and I changed that IP to the correct one on all of the nodes and
restarted ES. After that change, I no longer see the above error.

I have a question - considering the first IP was wrong, the cluster would
have elected the only other node configured as master as the master.
This means that there was atleast one master in the cluster. So, for this
exception to happen, could the other master's metadata about shards be
wrong?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f5f785a1-6b91-4ff6-b9f9-ed3580c6e510%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Boaz Leskes) #3

HI Nikhil,

The wrong ip in the unicast is unrelated to the error. The unicast list is
only used when a node first startup and need to join the cluster. Once
that's done the list is not used anymore. It does mean you potentially
could have had the first master node (whose IP was wrong in the list)
ellect itself as master and no one would have connected to it (if it was
the first to start).

As to the error - I'm not sure exactly what's caused the shards not be
ready in time. Is there anything you see in the ES logs? Also, does the
indexing succeed once retried? Can it be that this is always the first
message in it's respecting index?

Cheers,
Boaz

On Wednesday, March 12, 2014 6:31:28 PM UTC+1, Nikhil Singh wrote:

I forgot to mention the ES version that we are running. It is 0.90.3

On Wednesday, March 12, 2014 10:54:46 PM UTC+5:30, Nikhil Singh wrote:

Hi all,

We have a 5 node elasticsearch setup as follows:

  • 3 nodes running in data mode
  • 2 nodes running in master only mode

Each of the nodes have the following configuration options set in
elasticsearch.yml file

index.number_of_shards: 6
index.number_of_replicas: 0
index.mapper.dynamic: true
action.auto_create_index: true
action.disable_delete_all_indices: true

What I understand from the above config is that each index will have 6
shards and there will not be any replica of each of the shards. Also, we
have the following discovery options set

discovery.zen.minimum_master_nodes: 1
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.101.55.93[9300-9400]",
"10.242.22.126[9300-9400]", "10.144.77.94[9300-9400]",
"10.116.173.148[9300-9400]", "10.224.42.205[9300-9400]"]

We are also running logstash which is indexing documents in this
elasticsearch setup. Yesterday, I started noticing the following errors in
the logstash indexer log file

{:timestamp=>"2014-03-10T11:05:44.453000+0000", :message=>"Failed to
index an event, will retry", :exception=>
org.elasticsearch.action.UnavailableShardsException:
[logstash-2014.03.10][1] [1] shardIt, [0] active : Timeout waiting for
[1m], request: index
{[logstash-2014.03.10][application][uMrD19E4QuKepoBq17ZKQA],
source[{"@source":"file://ip-10-118-115-235//mnt/deploy/apache-tomcat/logs/application.log","@tags":["shipped"],"@fields":{"environment":["production"],"service":["application"],"machine":["proxy1"],"timestamp":"03/10/14
06:03:07.292","thread":"http-8080-126","severity":"DEBUG","message":"ources.AccountServicesResource
idx=0"},"@timestamp":"2014-03-10T06:03:07.292Z","@source_host":"ip-10-118-115-235","@source_path":"//mnt/deploy/apache-tomcat/logs/application.log","@message":"03/10/14
06:03:07.292 [http-8080-126] DEBUG ources.AccountServicesResource
idx=0","@type":"application"}]},
:event=>{"@source"=>"file://ip-10-118-115-235//mnt/deploy/apache-tomcat/logs/application.log",
"@tags"=>["shipped"], "@fields"=>{"environment"=>["production"],
"service"=>["application"], "machine"=>["proxy1"], "timestamp"=>"03/10/14
06:03:07.292", "thread"=>"http-8080-126", "severity"=>"DEBUG",
"message"=>"ources.AccountServicesResource idx=0"},
"@timestamp"=>"2014-03-10T06:03:07.292Z",
"@source_host"=>"ip-10-118-115-235",
"@source_path"=>"//mnt/deploy/apache-tomcat/logs/application.log",
"@message"=>"03/10/14 06:03:07.292 [http-8080-126] DEBUG
ources.AccountServicesResource idx=0", "@type"=>"application"},
:level=>:warn}

While I was debugging this issue, I noticed that the first IP in the *discovery.zen.ping.unicast.hosts
*was wrong. That IP did not point to any of the 5 nodes in the cluster.
I realized that the first IP should be of one of the nodes configured as
master and I changed that IP to the correct one on all of the nodes and
restarted ES. After that change, I no longer see the above error.

I have a question - considering the first IP was wrong, the cluster would
have elected the only other node configured as master as the master.
This means that there was atleast one master in the cluster. So, for this
exception to happen, could the other master's metadata about shards be
wrong?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/516b52b6-016e-4c42-a642-3cdfb409d5e0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Nikhil Singh) #4

Hi Boaz,

Following are the errors that I am seeing in ES logs
[2014-03-19 02:42:07,067][WARN ][cluster.action.shard ] [kibby-master]received shard failed
for [logstash-2014.03.15][0], node[Iq16vnyfRkyZxM2WzN0X_g], [R], s[
INITIALIZING], reason [Failed to start shard, message [
RecoveryFailedException[[logstash-2014.03.15][0]: Recovery failed from [
kibby0][uCgP657HTjuSqtxIRuZc-A][inet[/10.144.77.94:9300]]{
max_local_storage_nodes=1, master=false} into [kibby2][
Iq16vnyfRkyZxM2WzN0X_g][inet[/10.224.42.205:9300]]{max_local_storage_nodes=1
, master=false}]; nested: RemoteTransportException[[kibby0][inet[/10.144.
77.94:9300]][index/shard/recovery/startRecovery]]; nested:
RecoveryEngineException[[logstash-2014.03.15][0] Phase[3] Execution failed];nested
: ReceiveTimeoutTransportException[[kibby2][inet[/10.224.42.205:9300]][index
/shard/recovery/finalize] request_id [117616621] timed out after [1800000ms
]]; ]]
[2014-03-19 02:57:07,126][WARN ][cluster.action.shard ] [kibby-master]received shard failed
for [logstash-2014.03.14][0], node[Iq16vnyfRkyZxM2WzN0X_g], [R], s[
INITIALIZING], reason [Failed to start shard, message [
RecoveryFailedException[[logstash-2014.03.14][0]: Recovery failed from [
kibby0][uCgP657HTjuSqtxIRuZc-A][inet[/10.144.77.94:9300]]{
max_local_storage_nodes=1, master=false} into [kibby2][
Iq16vnyfRkyZxM2WzN0X_g][inet[/10.224.42.205:9300]]{max_local_storage_nodes=1
, master=false}]; nested: RemoteTransportException[[kibby0][inet[/10.144.
77.94:9300]][index/shard/recovery/startRecovery]]; nested:
RecoveryEngineException[[logstash-2014.03.14][0] Phase[1] Execution failed];nested
: RecoverFilesRecoveryException[[logstash-2014.03.14][0] Failed to transfer
[1] files with total size of [2.2gb]]; nested:
ReceiveTimeoutTransportException[[kibby2][inet[/10.224.42.205:9300]][index/
shard/recovery/cleanFiles] request_id [117618431] timed out after [900000ms
]]; ]]
[2014-03-19 11:34:45,525][WARN ][cluster.action.shard ] [kibby-master]received shard failed
for [logstash-2014.03.17][4], node[Iq16vnyfRkyZxM2WzN0X_g], [R], s[
INITIALIZING], reason [Failed to start shard, message [
RecoveryFailedException[[logstash-2014.03.17][4]: Recovery failed from [
kibby0][uCgP657HTjuSqtxIRuZc-A][inet[/10.144.77.94:9300]]{
max_local_storage_nodes=1, master=false} into [kibby2][
Iq16vnyfRkyZxM2WzN0X_g][inet[/10.224.42.205:9300]]{max_local_storage_nodes=1
, master=false}]; nested: RemoteTransportException[[kibby0][inet[/10.144.
77.94:9300]][index/shard/recovery/startRecovery]]; nested:
RecoveryEngineException[[logstash-2014.03.17][4] Phase[3] Execution failed];nested
: ReceiveTimeoutTransportException[[kibby2][inet[/10.224.42.205:9300]][index
/shard/recovery/finalize] request_id [117648738] timed out after [1800000ms
]]; ]]
[2014-03-19 11:49:46,417][WARN ][cluster.action.shard ] [kibby-master]received shard failed
for [logstash-2014.03.15][5], node[Iq16vnyfRkyZxM2WzN0X_g], [R], s[
INITIALIZING], reason [Failed to start shard, message [
RecoveryFailedException[[logstash-2014.03.15][5]: Recovery failed from [
kibby0][uCgP657HTjuSqtxIRuZc-A][inet[/10.144.77.94:9300]]{
max_local_storage_nodes=1, master=false} into [kibby2][
Iq16vnyfRkyZxM2WzN0X_g][inet[/10.224.42.205:9300]]{max_local_storage_nodes=1
, master=false}]; nested: RemoteTransportException[[kibby0][inet[/10.144.
77.94:9300]][index/shard/recovery/startRecovery]]; nested:
RecoveryEngineException[[logstash-2014.03.15][5] Phase[1] Execution failed];nested
: RecoverFilesRecoveryException[[logstash-2014.03.15][5] Failed to transfer
[1] files with total size of [1.7gb]]; nested:
ReceiveTimeoutTransportException[[kibby2][inet[/10.224.42.205:9300]][index/
shard/recovery/cleanFiles] request_id [117650548] timed out after [900000ms
]]; ]]

Now that I look at the logs carefully, it seems that the size of the shards
are huge 1.7GB to 2.2GB in some cases. I am guessing that ES is trying to
move shards around and that is timing out. I am not sure if these errors
would actually cause the shard to not be available.

On Wednesday, March 19, 2014 5:28:41 PM UTC+5:30, Boaz Leskes wrote:

HI Nikhil,

The wrong ip in the unicast is unrelated to the error. The unicast list is
only used when a node first startup and need to join the cluster. Once
that's done the list is not used anymore. It does mean you potentially
could have had the first master node (whose IP was wrong in the list)
ellect itself as master and no one would have connected to it (if it was
the first to start).

As to the error - I'm not sure exactly what's caused the shards not be
ready in time. Is there anything you see in the ES logs? Also, does the
indexing succeed once retried? Can it be that this is always the first
message in it's respecting index?

Cheers,
Boaz

On Wednesday, March 12, 2014 6:31:28 PM UTC+1, Nikhil Singh wrote:

I forgot to mention the ES version that we are running. It is 0.90.3

On Wednesday, March 12, 2014 10:54:46 PM UTC+5:30, Nikhil Singh wrote:

Hi all,

We have a 5 node elasticsearch setup as follows:

  • 3 nodes running in data mode
  • 2 nodes running in master only mode

Each of the nodes have the following configuration options set in
elasticsearch.yml file

index.number_of_shards: 6
index.number_of_replicas: 0
index.mapper.dynamic: true
action.auto_create_index: true
action.disable_delete_all_indices: true

What I understand from the above config is that each index will have 6
shards and there will not be any replica of each of the shards. Also, we
have the following discovery options set

discovery.zen.minimum_master_nodes: 1
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.101.55.93[9300-9400]",
"10.242.22.126[9300-9400]", "10.144.77.94[9300-9400]",
"10.116.173.148[9300-9400]", "10.224.42.205[9300-9400]"]

We are also running logstash which is indexing documents in this
elasticsearch setup. Yesterday, I started noticing the following errors in
the logstash indexer log file

{:timestamp=>"2014-03-10T11:05:44.453000+0000", :message=>"Failed to
index an event, will retry", :exception=>
org.elasticsearch.action.UnavailableShardsException:
[logstash-2014.03.10][1] [1] shardIt, [0] active : Timeout waiting for
[1m], request: index
{[logstash-2014.03.10][application][uMrD19E4QuKepoBq17ZKQA],
source[{"@source":"file://ip-10-118-115-235//mnt/deploy/apache-tomcat/logs/application.log","@tags":["shipped"],"@fields":{"environment":["production"],"service":["application"],"machine":["proxy1"],"timestamp":"03/10/14
06:03:07.292","thread":"http-8080-126","severity":"DEBUG","message":"ources.AccountServicesResource
idx=0"},"@timestamp":"2014-03-10T06:03:07.292Z","@source_host":"ip-10-118-115-235","@source_path":"//mnt/deploy/apache-tomcat/logs/application.log","@message":"03/10/14
06:03:07.292 [http-8080-126] DEBUG ources.AccountServicesResource
idx=0","@type":"application"}]},
:event=>{"@source"=>"file://ip-10-118-115-235//mnt/deploy/apache-tomcat/logs/application.log",
"@tags"=>["shipped"], "@fields"=>{"environment"=>["production"],
"service"=>["application"], "machine"=>["proxy1"], "timestamp"=>"03/10/14
06:03:07.292", "thread"=>"http-8080-126", "severity"=>"DEBUG",
"message"=>"ources.AccountServicesResource idx=0"},
"@timestamp"=>"2014-03-10T06:03:07.292Z",
"@source_host"=>"ip-10-118-115-235",
"@source_path"=>"//mnt/deploy/apache-tomcat/logs/application.log",
"@message"=>"03/10/14 06:03:07.292 [http-8080-126] DEBUG
ources.AccountServicesResource idx=0", "@type"=>"application"},
:level=>:warn}

While I was debugging this issue, I noticed that the first IP in the *discovery.zen.ping.unicast.hosts
*was wrong. That IP did not point to any of the 5 nodes in the cluster.
I realized that the first IP should be of one of the nodes configured as
master and I changed that IP to the correct one on all of the nodes and
restarted ES. After that change, I no longer see the above error.

I have a question - considering the first IP was wrong, the cluster
would have elected the only other node configured as master as the
master. This means that there was atleast one master in the cluster.
So, for this exception to happen, could the other master's metadata about
shards be wrong?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e4b3daa5-590a-4f71-8c92-976c4072fa77%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5