How to restart/recover a shard?


(Kenneth Loafman) #1

Hi,

The second shard on one of my indexes has failed due to:
[05:59:47,332][WARN ][index.gateway ] [Mangog] [twitter][1]
failed to snapshot on close
...followed by a long traceback.
...followed by:
[05:59:49,336][WARN ][cluster.action.shard ] [Mangog] received shard
failed for [twitter][1], node[86d601df-e124-45ed-a5f2-57d762042d87],
[P], s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] Failed to recovery
translog]; nested: EngineCreationFailureException[[twitter][1] Failed to
open reader on writer]; nested:
FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs
(No such file or directory)]; ]]

Is the recovery process automatic, or do I have to do something
special? It appears to be just this one shard.

I use the service wrapper to start/stop 0.9.1-SNAPSHOT, and my config is
below.

...Thanks,
...Ken

cloud:
aws:
access_key: *****
secret_key: *****

gateway:
type: s3
s3:
bucket: *****

path :
work : /mnt/search-data-dev
logs : /mnt/search-data-dev/node1/logs

index :
number_of_shards : 2
number_of_replicas : 1

network :
host : 192.168.1.5


(Shay Banon) #2

The shard will allocated to another node and recovered there. Do you see it
happen continuously?

-shay.banon

On Thu, Aug 19, 2010 at 2:28 PM, Kenneth Loafman
kenneth.loafman@gmail.comwrote:

Hi,

The second shard on one of my indexes has failed due to:
[05:59:47,332][WARN ][index.gateway ] [Mangog] [twitter][1]
failed to snapshot on close
...followed by a long traceback.
...followed by:
[05:59:49,336][WARN ][cluster.action.shard ] [Mangog] received shard
failed for [twitter][1], node[86d601df-e124-45ed-a5f2-57d762042d87],
[P], s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] Failed to recovery
translog]; nested: EngineCreationFailureException[[twitter][1] Failed to
open reader on writer]; nested:

FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs
(No such file or directory)]; ]]

Is the recovery process automatic, or do I have to do something
special? It appears to be just this one shard.

I use the service wrapper to start/stop 0.9.1-SNAPSHOT, and my config is
below.

...Thanks,
...Ken

cloud:
aws:
access_key: *****
secret_key: *****

gateway:
type: s3
s3:
bucket: *****

path :
work : /mnt/search-data-dev
logs : /mnt/search-data-dev/node1/logs

index :
number_of_shards : 2
number_of_replicas : 1

network :
host : 192.168.1.5


(Kenneth Loafman) #3

No this is the first time. The shutdown took a while with several
'Waiting for not to shutdown..." style message. It came up bad after that.

So, if I have two nodes now, and one needs to be recovered, I'll need 3
nodes to get the recovery done?

...Ken

Shay Banon wrote:

The shard will allocated to another node and recovered there. Do you see
it happen continuously?

-shay.banon

On Thu, Aug 19, 2010 at 2:28 PM, Kenneth Loafman
<kenneth.loafman@gmail.com mailto:kenneth.loafman@gmail.com> wrote:

Hi,

The second shard on one of my indexes has failed due to:
[05:59:47,332][WARN ][index.gateway            ] [Mangog] [twitter][1]
failed to snapshot on close
...followed by a long traceback.
...followed by:
[05:59:49,336][WARN ][cluster.action.shard     ] [Mangog] received shard
failed for [twitter][1], node[86d601df-e124-45ed-a5f2-57d762042d87],
[P], s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] Failed to recovery
translog]; nested: EngineCreationFailureException[[twitter][1] Failed to
open reader on writer]; nested:
FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs
(No such file or directory)]; ]]

Is the recovery process automatic, or do I have to do something
special?  It appears to be just this one shard.

I use the service wrapper to start/stop 0.9.1-SNAPSHOT, and my config is
below.

...Thanks,
...Ken

cloud:
   aws:
       access_key: *****
       secret_key: *****

gateway:
   type: s3
   s3:
       bucket: *****

path :
   work : /mnt/search-data-dev
   logs : /mnt/search-data-dev/node1/logs

index :
   number_of_shards : 2
   number_of_replicas : 1

network :
   host : 192.168.1.5

(Shay Banon) #4

It should be allocated on the other node, you shouldn't need to start
another node. When you issue a cluster health (simple curl can do), what is
the status? The cluster state API gives you more information if you are
after (each shard and its state).

On Thu, Aug 19, 2010 at 3:48 PM, Kenneth Loafman
kenneth.loafman@gmail.comwrote:

No this is the first time. The shutdown took a while with several
'Waiting for not to shutdown..." style message. It came up bad after that.

So, if I have two nodes now, and one needs to be recovered, I'll need 3
nodes to get the recovery done?

...Ken

Shay Banon wrote:

The shard will allocated to another node and recovered there. Do you see
it happen continuously?

-shay.banon

On Thu, Aug 19, 2010 at 2:28 PM, Kenneth Loafman
<kenneth.loafman@gmail.com mailto:kenneth.loafman@gmail.com> wrote:

Hi,

The second shard on one of my indexes has failed due to:
[05:59:47,332][WARN ][index.gateway            ] [Mangog]

[twitter][1]

failed to snapshot on close
...followed by a long traceback.
...followed by:
[05:59:49,336][WARN ][cluster.action.shard     ] [Mangog] received

shard

failed for [twitter][1], node[86d601df-e124-45ed-a5f2-57d762042d87],
[P], s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] Failed to recovery
translog]; nested: EngineCreationFailureException[[twitter][1] Failed

to

open reader on writer]; nested:

FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs

(No such file or directory)]; ]]

Is the recovery process automatic, or do I have to do something
special?  It appears to be just this one shard.

I use the service wrapper to start/stop 0.9.1-SNAPSHOT, and my config

is

below.

...Thanks,
...Ken

cloud:
   aws:
       access_key: *****
       secret_key: *****

gateway:
   type: s3
   s3:
       bucket: *****

path :
   work : /mnt/search-data-dev
   logs : /mnt/search-data-dev/node1/logs

index :
   number_of_shards : 2
   number_of_replicas : 1

network :
   host : 192.168.1.5

(Kenneth Loafman) #5

It seems to have started recover, but it's been 7.5 hours and appears to
be stopped/hung...

            "1": [
                {
                    "gateway_recovery": {
                        "index": {
                            "expected_recovered_size": "0b", 
                            "expected_recovered_size_in_bytes": 0, 
                            "recovered_size": "0b", 
                            "recovered_size_in_bytes": 0, 
                            "reused_size": "0b", 
                            "reused_size_in_bytes": 0, 
                            "size": "0b", 
                            "size_in_bytes": 0, 
                            "throttling_time": "0s", 
                            "throttling_time_in_millis": 0
                        }, 
                        "stage": "RETRY", 
                        "start_time_in_millis": 1282226019603, 
                        "throttling_time": "7.6h", 
                        "throttling_time_in_millis": 27514627, 
                        "time": "7.6h", 
                        "time_in_millis": 27514657, 
                        "translog": {
                            "recovered": 0
                        }
                    }, 
                    "index": {
                        "size": "0b", 
                        "size_in_bytes": 0
                    }, 
                    "routing": {
                        "index": "twitter", 
                        "node": "031642a1-968f-40fb-b7c2-5a869769d5b4", 
                        "primary": true, 
                        "relocating_node": null, 
                        "shard": 1, 
                        "state": "INITIALIZING"
                    }, 
                    "state": "RECOVERING"
                }
            ]

Shay Banon wrote:

It should be allocated on the other node, you shouldn't need to start
another node. When you issue a cluster health (simple curl can do), what
is the status? The cluster state API gives you more information if you
are after (each shard and its state).

On Thu, Aug 19, 2010 at 3:48 PM, Kenneth Loafman
<kenneth.loafman@gmail.com mailto:kenneth.loafman@gmail.com> wrote:

No this is the first time.  The shutdown took a while with several
'Waiting for not to shutdown..." style message.  It came up bad
after that.

So, if I have two nodes now, and one needs to be recovered, I'll need 3
nodes to get the recovery done?

...Ken

Shay Banon wrote:
> The shard will allocated to another node and recovered there. Do
you see
> it happen continuously?
>
> -shay.banon
>
> On Thu, Aug 19, 2010 at 2:28 PM, Kenneth Loafman
> <kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
<mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>> wrote:
>
>     Hi,
>
>     The second shard on one of my indexes has failed due to:
>     [05:59:47,332][WARN ][index.gateway            ] [Mangog]
[twitter][1]
>     failed to snapshot on close
>     ...followed by a long traceback.
>     ...followed by:
>     [05:59:49,336][WARN ][cluster.action.shard     ] [Mangog]
received shard
>     failed for [twitter][1],
node[86d601df-e124-45ed-a5f2-57d762042d87],
>     [P], s[INITIALIZING], reason [Failed to start shard, message
>     [IndexShardGatewayRecoveryException[[twitter][1] Failed to
recovery
>     translog]; nested: EngineCreationFailureException[[twitter][1]
Failed to
>     open reader on writer]; nested:
>    
FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs
>     (No such file or directory)]; ]]
>
>     Is the recovery process automatic, or do I have to do something
>     special?  It appears to be just this one shard.
>
>     I use the service wrapper to start/stop 0.9.1-SNAPSHOT, and my
config is
>     below.
>
>     ...Thanks,
>     ...Ken
>
>     cloud:
>        aws:
>            access_key: *****
>            secret_key: *****
>
>     gateway:
>        type: s3
>        s3:
>            bucket: *****
>
>     path :
>        work : /mnt/search-data-dev
>        logs : /mnt/search-data-dev/node1/logs
>
>     index :
>        number_of_shards : 2
>        number_of_replicas : 1
>
>     network :
>        host : 192.168.1.5
>
>

(Shay Banon) #6

Do you still have it running? Can you gist the cluster state and the index
status results?

I see that you are using master, I have fixed several things in this area,
can you pull a new version?

-shay.banon

On Fri, Aug 20, 2010 at 12:33 AM, Kenneth Loafman <kenneth.loafman@gmail.com

wrote:

It seems to have started recover, but it's been 7.5 hours and appears to
be stopped/hung...

            "1": [
                {
                    "gateway_recovery": {
                        "index": {
                            "expected_recovered_size": "0b",
                            "expected_recovered_size_in_bytes": 0,
                            "recovered_size": "0b",
                            "recovered_size_in_bytes": 0,
                            "reused_size": "0b",
                            "reused_size_in_bytes": 0,
                            "size": "0b",
                            "size_in_bytes": 0,
                            "throttling_time": "0s",
                            "throttling_time_in_millis": 0
                        },
                        "stage": "RETRY",
                        "start_time_in_millis": 1282226019603,
                        "throttling_time": "7.6h",
                        "throttling_time_in_millis": 27514627,
                        "time": "7.6h",
                        "time_in_millis": 27514657,
                        "translog": {
                            "recovered": 0
                        }
                    },
                    "index": {
                        "size": "0b",
                        "size_in_bytes": 0
                    },
                    "routing": {
                        "index": "twitter",
                        "node":

"031642a1-968f-40fb-b7c2-5a869769d5b4",

                        "primary": true,
                        "relocating_node": null,
                        "shard": 1,
                        "state": "INITIALIZING"
                    },
                    "state": "RECOVERING"
                }
            ]

Shay Banon wrote:

It should be allocated on the other node, you shouldn't need to start
another node. When you issue a cluster health (simple curl can do), what
is the status? The cluster state API gives you more information if you
are after (each shard and its state).

On Thu, Aug 19, 2010 at 3:48 PM, Kenneth Loafman
<kenneth.loafman@gmail.com mailto:kenneth.loafman@gmail.com> wrote:

No this is the first time.  The shutdown took a while with several
'Waiting for not to shutdown..." style message.  It came up bad
after that.

So, if I have two nodes now, and one needs to be recovered, I'll need

3

nodes to get the recovery done?

...Ken

Shay Banon wrote:
> The shard will allocated to another node and recovered there. Do
you see
> it happen continuously?
>
> -shay.banon
>
> On Thu, Aug 19, 2010 at 2:28 PM, Kenneth Loafman
> <kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
<mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>> wrote:
>
>     Hi,
>
>     The second shard on one of my indexes has failed due to:
>     [05:59:47,332][WARN ][index.gateway            ] [Mangog]
[twitter][1]
>     failed to snapshot on close
>     ...followed by a long traceback.
>     ...followed by:
>     [05:59:49,336][WARN ][cluster.action.shard     ] [Mangog]
received shard
>     failed for [twitter][1],
node[86d601df-e124-45ed-a5f2-57d762042d87],
>     [P], s[INITIALIZING], reason [Failed to start shard, message
>     [IndexShardGatewayRecoveryException[[twitter][1] Failed to
recovery
>     translog]; nested: EngineCreationFailureException[[twitter][1]
Failed to
>     open reader on writer]; nested:
>

FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs

>     (No such file or directory)]; ]]
>
>     Is the recovery process automatic, or do I have to do something
>     special?  It appears to be just this one shard.
>
>     I use the service wrapper to start/stop 0.9.1-SNAPSHOT, and my
config is
>     below.
>
>     ...Thanks,
>     ...Ken
>
>     cloud:
>        aws:
>            access_key: *****
>            secret_key: *****
>
>     gateway:
>        type: s3
>        s3:
>            bucket: *****
>
>     path :
>        work : /mnt/search-data-dev
>        logs : /mnt/search-data-dev/node1/logs
>
>     index :
>        number_of_shards : 2
>        number_of_replicas : 1
>
>     network :
>        host : 192.168.1.5
>
>

(Kenneth Loafman) #7

I upgraded to last nights version, restarted, and things are worse. Now
I have 5 shards hung at recover, not all on the same node. Weird.

I've attached the info you want. I'll leave things running for now.

...Thanks,
...Ken

Shay Banon wrote:

Do you still have it running? Can you gist the cluster state and the
index status results?

I see that you are using master, I have fixed several things in this
area, can you pull a new version?

-shay.banon

On Fri, Aug 20, 2010 at 12:33 AM, Kenneth Loafman
<kenneth.loafman@gmail.com mailto:kenneth.loafman@gmail.com> wrote:

It seems to have started recover, but it's been 7.5 hours and appears to
be stopped/hung...

>                 "1": [
>                     {
>                         "gateway_recovery": {
>                             "index": {
>                                 "expected_recovered_size": "0b",
>                                 "expected_recovered_size_in_bytes": 0,
>                                 "recovered_size": "0b",
>                                 "recovered_size_in_bytes": 0,
>                                 "reused_size": "0b",
>                                 "reused_size_in_bytes": 0,
>                                 "size": "0b",
>                                 "size_in_bytes": 0,
>                                 "throttling_time": "0s",
>                                 "throttling_time_in_millis": 0
>                             },
>                             "stage": "RETRY",
>                             "start_time_in_millis": 1282226019603,
>                             "throttling_time": "7.6h",
>                             "throttling_time_in_millis": 27514627,
>                             "time": "7.6h",
>                             "time_in_millis": 27514657,
>                             "translog": {
>                                 "recovered": 0
>                             }
>                         },
>                         "index": {
>                             "size": "0b",
>                             "size_in_bytes": 0
>                         },
>                         "routing": {
>                             "index": "twitter",
>                             "node":
"031642a1-968f-40fb-b7c2-5a869769d5b4",
>                             "primary": true,
>                             "relocating_node": null,
>                             "shard": 1,
>                             "state": "INITIALIZING"
>                         },
>                         "state": "RECOVERING"
>                     }
>                 ]


Shay Banon wrote:
> It should be allocated on the other node, you shouldn't need to start
> another node. When you issue a cluster health (simple curl can
do), what
> is the status? The cluster state API gives you more information if you
> are after (each shard and its state).
>
> On Thu, Aug 19, 2010 at 3:48 PM, Kenneth Loafman
> <kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
<mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>> wrote:
>
>     No this is the first time.  The shutdown took a while with several
>     'Waiting for not to shutdown..." style message.  It came up bad
>     after that.
>
>     So, if I have two nodes now, and one needs to be recovered,
I'll need 3
>     nodes to get the recovery done?
>
>     ...Ken
>
>     Shay Banon wrote:
>     > The shard will allocated to another node and recovered there. Do
>     you see
>     > it happen continuously?
>     >
>     > -shay.banon
>     >
>     > On Thu, Aug 19, 2010 at 2:28 PM, Kenneth Loafman
>     > <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com> <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>> wrote:
>     >
>     >     Hi,
>     >
>     >     The second shard on one of my indexes has failed due to:
>     >     [05:59:47,332][WARN ][index.gateway            ] [Mangog]
>     [twitter][1]
>     >     failed to snapshot on close
>     >     ...followed by a long traceback.
>     >     ...followed by:
>     >     [05:59:49,336][WARN ][cluster.action.shard     ] [Mangog]
>     received shard
>     >     failed for [twitter][1],
>     node[86d601df-e124-45ed-a5f2-57d762042d87],
>     >     [P], s[INITIALIZING], reason [Failed to start shard, message
>     >     [IndexShardGatewayRecoveryException[[twitter][1] Failed to
>     recovery
>     >     translog]; nested:
EngineCreationFailureException[[twitter][1]
>     Failed to
>     >     open reader on writer]; nested:
>     >
>    
FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs
>     >     (No such file or directory)]; ]]
>     >
>     >     Is the recovery process automatic, or do I have to do
something
>     >     special?  It appears to be just this one shard.
>     >
>     >     I use the service wrapper to start/stop 0.9.1-SNAPSHOT,
and my
>     config is
>     >     below.
>     >
>     >     ...Thanks,
>     >     ...Ken
>     >
>     >     cloud:
>     >        aws:
>     >            access_key: *****
>     >            secret_key: *****
>     >
>     >     gateway:
>     >        type: s3
>     >        s3:
>     >            bucket: *****
>     >
>     >     path :
>     >        work : /mnt/search-data-dev
>     >        logs : /mnt/search-data-dev/node1/logs
>     >
>     >     index :
>     >        number_of_shards : 2
>     >        number_of_replicas : 1
>     >
>     >     network :
>     >        host : 192.168.1.5
>     >
>     >
>
>

(Kenneth Loafman) #8

Attachments did not make it. See:
http://pastebin.com/ziALRgx5 -- cluster state
http://pastebin.com/63Xm95xM -- index status

Sorry, they lost their formatting on Pastbin.

...Ken

Kenneth Loafman wrote:

I upgraded to last nights version, restarted, and things are worse. Now
I have 5 shards hung at recover, not all on the same node. Weird.

I've attached the info you want. I'll leave things running for now.

...Thanks,
...Ken

Shay Banon wrote:

Do you still have it running? Can you gist the cluster state and the
index status results?

I see that you are using master, I have fixed several things in this
area, can you pull a new version?

-shay.banon

On Fri, Aug 20, 2010 at 12:33 AM, Kenneth Loafman
<kenneth.loafman@gmail.com mailto:kenneth.loafman@gmail.com> wrote:

It seems to have started recover, but it's been 7.5 hours and appears to
be stopped/hung...

>                 "1": [
>                     {
>                         "gateway_recovery": {
>                             "index": {
>                                 "expected_recovered_size": "0b",
>                                 "expected_recovered_size_in_bytes": 0,
>                                 "recovered_size": "0b",
>                                 "recovered_size_in_bytes": 0,
>                                 "reused_size": "0b",
>                                 "reused_size_in_bytes": 0,
>                                 "size": "0b",
>                                 "size_in_bytes": 0,
>                                 "throttling_time": "0s",
>                                 "throttling_time_in_millis": 0
>                             },
>                             "stage": "RETRY",
>                             "start_time_in_millis": 1282226019603,
>                             "throttling_time": "7.6h",
>                             "throttling_time_in_millis": 27514627,
>                             "time": "7.6h",
>                             "time_in_millis": 27514657,
>                             "translog": {
>                                 "recovered": 0
>                             }
>                         },
>                         "index": {
>                             "size": "0b",
>                             "size_in_bytes": 0
>                         },
>                         "routing": {
>                             "index": "twitter",
>                             "node":
"031642a1-968f-40fb-b7c2-5a869769d5b4",
>                             "primary": true,
>                             "relocating_node": null,
>                             "shard": 1,
>                             "state": "INITIALIZING"
>                         },
>                         "state": "RECOVERING"
>                     }
>                 ]


Shay Banon wrote:
> It should be allocated on the other node, you shouldn't need to start
> another node. When you issue a cluster health (simple curl can
do), what
> is the status? The cluster state API gives you more information if you
> are after (each shard and its state).
>
> On Thu, Aug 19, 2010 at 3:48 PM, Kenneth Loafman
> <kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
<mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>> wrote:
>
>     No this is the first time.  The shutdown took a while with several
>     'Waiting for not to shutdown..." style message.  It came up bad
>     after that.
>
>     So, if I have two nodes now, and one needs to be recovered,
I'll need 3
>     nodes to get the recovery done?
>
>     ...Ken
>
>     Shay Banon wrote:
>     > The shard will allocated to another node and recovered there. Do
>     you see
>     > it happen continuously?
>     >
>     > -shay.banon
>     >
>     > On Thu, Aug 19, 2010 at 2:28 PM, Kenneth Loafman
>     > <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com> <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>> wrote:
>     >
>     >     Hi,
>     >
>     >     The second shard on one of my indexes has failed due to:
>     >     [05:59:47,332][WARN ][index.gateway            ] [Mangog]
>     [twitter][1]
>     >     failed to snapshot on close
>     >     ...followed by a long traceback.
>     >     ...followed by:
>     >     [05:59:49,336][WARN ][cluster.action.shard     ] [Mangog]
>     received shard
>     >     failed for [twitter][1],
>     node[86d601df-e124-45ed-a5f2-57d762042d87],
>     >     [P], s[INITIALIZING], reason [Failed to start shard, message
>     >     [IndexShardGatewayRecoveryException[[twitter][1] Failed to
>     recovery
>     >     translog]; nested:
EngineCreationFailureException[[twitter][1]
>     Failed to
>     >     open reader on writer]; nested:
>     >
>    
FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs
>     >     (No such file or directory)]; ]]
>     >
>     >     Is the recovery process automatic, or do I have to do
something
>     >     special?  It appears to be just this one shard.
>     >
>     >     I use the service wrapper to start/stop 0.9.1-SNAPSHOT,
and my
>     config is
>     >     below.
>     >
>     >     ...Thanks,
>     >     ...Ken
>     >
>     >     cloud:
>     >        aws:
>     >            access_key: *****
>     >            secret_key: *****
>     >
>     >     gateway:
>     >        type: s3
>     >        s3:
>     >            bucket: *****
>     >
>     >     path :
>     >        work : /mnt/search-data-dev
>     >        logs : /mnt/search-data-dev/node1/logs
>     >
>     >     index :
>     >        number_of_shards : 2
>     >        number_of_replicas : 1
>     >
>     >     network :
>     >        host : 192.168.1.5
>     >
>     >
>
>

(Shay Banon) #9

Just pushed a fix for this.

On Fri, Aug 20, 2010 at 3:31 PM, Kenneth Loafman
kenneth.loafman@gmail.comwrote:

Attachments did not make it. See:
http://pastebin.com/ziALRgx5 -- cluster state
http://pastebin.com/63Xm95xM -- index status

Sorry, they lost their formatting on Pastbin.

...Ken

Kenneth Loafman wrote:

I upgraded to last nights version, restarted, and things are worse. Now
I have 5 shards hung at recover, not all on the same node. Weird.

I've attached the info you want. I'll leave things running for now.

...Thanks,
...Ken

Shay Banon wrote:

Do you still have it running? Can you gist the cluster state and the
index status results?

I see that you are using master, I have fixed several things in this
area, can you pull a new version?

-shay.banon

On Fri, Aug 20, 2010 at 12:33 AM, Kenneth Loafman
<kenneth.loafman@gmail.com mailto:kenneth.loafman@gmail.com> wrote:

It seems to have started recover, but it's been 7.5 hours and

appears to

be stopped/hung...

>                 "1": [
>                     {
>                         "gateway_recovery": {
>                             "index": {
>                                 "expected_recovered_size": "0b",
>

"expected_recovered_size_in_bytes": 0,

>                                 "recovered_size": "0b",
>                                 "recovered_size_in_bytes": 0,
>                                 "reused_size": "0b",
>                                 "reused_size_in_bytes": 0,
>                                 "size": "0b",
>                                 "size_in_bytes": 0,
>                                 "throttling_time": "0s",
>                                 "throttling_time_in_millis": 0
>                             },
>                             "stage": "RETRY",
>                             "start_time_in_millis": 1282226019603,
>                             "throttling_time": "7.6h",
>                             "throttling_time_in_millis": 27514627,
>                             "time": "7.6h",
>                             "time_in_millis": 27514657,
>                             "translog": {
>                                 "recovered": 0
>                             }
>                         },
>                         "index": {
>                             "size": "0b",
>                             "size_in_bytes": 0
>                         },
>                         "routing": {
>                             "index": "twitter",
>                             "node":
"031642a1-968f-40fb-b7c2-5a869769d5b4",
>                             "primary": true,
>                             "relocating_node": null,
>                             "shard": 1,
>                             "state": "INITIALIZING"
>                         },
>                         "state": "RECOVERING"
>                     }
>                 ]


Shay Banon wrote:
> It should be allocated on the other node, you shouldn't need to

start

> another node. When you issue a cluster health (simple curl can
do), what
> is the status? The cluster state API gives you more information if

you

> are after (each shard and its state).
>
> On Thu, Aug 19, 2010 at 3:48 PM, Kenneth Loafman
> <kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
<mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>> wrote:
>
>     No this is the first time.  The shutdown took a while with

several

>     'Waiting for not to shutdown..." style message.  It came up

bad

>     after that.
>
>     So, if I have two nodes now, and one needs to be recovered,
I'll need 3
>     nodes to get the recovery done?
>
>     ...Ken
>
>     Shay Banon wrote:
>     > The shard will allocated to another node and recovered

there. Do

>     you see
>     > it happen continuously?
>     >
>     > -shay.banon
>     >
>     > On Thu, Aug 19, 2010 at 2:28 PM, Kenneth Loafman
>     > <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com> <mailto:

kenneth.loafman@gmail.com

<mailto:kenneth.loafman@gmail.com>>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>> wrote:
>     >
>     >     Hi,
>     >
>     >     The second shard on one of my indexes has failed due to:
>     >     [05:59:47,332][WARN ][index.gateway            ]

[Mangog]

>     [twitter][1]
>     >     failed to snapshot on close
>     >     ...followed by a long traceback.
>     >     ...followed by:
>     >     [05:59:49,336][WARN ][cluster.action.shard     ]

[Mangog]

>     received shard
>     >     failed for [twitter][1],
>     node[86d601df-e124-45ed-a5f2-57d762042d87],
>     >     [P], s[INITIALIZING], reason [Failed to start shard,

message

>     >     [IndexShardGatewayRecoveryException[[twitter][1] Failed

to

>     recovery
>     >     translog]; nested:
EngineCreationFailureException[[twitter][1]
>     Failed to
>     >     open reader on writer]; nested:
>     >
>

FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs

>     >     (No such file or directory)]; ]]
>     >
>     >     Is the recovery process automatic, or do I have to do
something
>     >     special?  It appears to be just this one shard.
>     >
>     >     I use the service wrapper to start/stop 0.9.1-SNAPSHOT,
and my
>     config is
>     >     below.
>     >
>     >     ...Thanks,
>     >     ...Ken
>     >
>     >     cloud:
>     >        aws:
>     >            access_key: *****
>     >            secret_key: *****
>     >
>     >     gateway:
>     >        type: s3
>     >        s3:
>     >            bucket: *****
>     >
>     >     path :
>     >        work : /mnt/search-data-dev
>     >        logs : /mnt/search-data-dev/node1/logs
>     >
>     >     index :
>     >        number_of_shards : 2
>     >        number_of_replicas : 1
>     >
>     >     network :
>     >        host : 192.168.1.5
>     >
>     >
>
>

(Shay Banon) #10

... can you test?

On Fri, Aug 20, 2010 at 4:02 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Just pushed a fix for this.

On Fri, Aug 20, 2010 at 3:31 PM, Kenneth Loafman <
kenneth.loafman@gmail.com> wrote:

Attachments did not make it. See:
http://pastebin.com/ziALRgx5 -- cluster state
http://pastebin.com/63Xm95xM -- index status

Sorry, they lost their formatting on Pastbin.

...Ken

Kenneth Loafman wrote:

I upgraded to last nights version, restarted, and things are worse. Now
I have 5 shards hung at recover, not all on the same node. Weird.

I've attached the info you want. I'll leave things running for now.

...Thanks,
...Ken

Shay Banon wrote:

Do you still have it running? Can you gist the cluster state and the
index status results?

I see that you are using master, I have fixed several things in this
area, can you pull a new version?

-shay.banon

On Fri, Aug 20, 2010 at 12:33 AM, Kenneth Loafman
<kenneth.loafman@gmail.com mailto:kenneth.loafman@gmail.com> wrote:

It seems to have started recover, but it's been 7.5 hours and

appears to

be stopped/hung...

>                 "1": [
>                     {
>                         "gateway_recovery": {
>                             "index": {
>                                 "expected_recovered_size": "0b",
>

"expected_recovered_size_in_bytes": 0,

>                                 "recovered_size": "0b",
>                                 "recovered_size_in_bytes": 0,
>                                 "reused_size": "0b",
>                                 "reused_size_in_bytes": 0,
>                                 "size": "0b",
>                                 "size_in_bytes": 0,
>                                 "throttling_time": "0s",
>                                 "throttling_time_in_millis": 0
>                             },
>                             "stage": "RETRY",
>                             "start_time_in_millis":

1282226019603,

>                             "throttling_time": "7.6h",
>                             "throttling_time_in_millis":

27514627,

>                             "time": "7.6h",
>                             "time_in_millis": 27514657,
>                             "translog": {
>                                 "recovered": 0
>                             }
>                         },
>                         "index": {
>                             "size": "0b",
>                             "size_in_bytes": 0
>                         },
>                         "routing": {
>                             "index": "twitter",
>                             "node":
"031642a1-968f-40fb-b7c2-5a869769d5b4",
>                             "primary": true,
>                             "relocating_node": null,
>                             "shard": 1,
>                             "state": "INITIALIZING"
>                         },
>                         "state": "RECOVERING"
>                     }
>                 ]


Shay Banon wrote:
> It should be allocated on the other node, you shouldn't need to

start

> another node. When you issue a cluster health (simple curl can
do), what
> is the status? The cluster state API gives you more information

if you

> are after (each shard and its state).
>
> On Thu, Aug 19, 2010 at 3:48 PM, Kenneth Loafman
> <kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
<mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>> wrote:
>
>     No this is the first time.  The shutdown took a while with

several

>     'Waiting for not to shutdown..." style message.  It came up

bad

>     after that.
>
>     So, if I have two nodes now, and one needs to be recovered,
I'll need 3
>     nodes to get the recovery done?
>
>     ...Ken
>
>     Shay Banon wrote:
>     > The shard will allocated to another node and recovered

there. Do

>     you see
>     > it happen continuously?
>     >
>     > -shay.banon
>     >
>     > On Thu, Aug 19, 2010 at 2:28 PM, Kenneth Loafman
>     > <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com> <mailto:

kenneth.loafman@gmail.com

<mailto:kenneth.loafman@gmail.com>>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>> wrote:
>     >
>     >     Hi,
>     >
>     >     The second shard on one of my indexes has failed due

to:

>     >     [05:59:47,332][WARN ][index.gateway            ]

[Mangog]

>     [twitter][1]
>     >     failed to snapshot on close
>     >     ...followed by a long traceback.
>     >     ...followed by:
>     >     [05:59:49,336][WARN ][cluster.action.shard     ]

[Mangog]

>     received shard
>     >     failed for [twitter][1],
>     node[86d601df-e124-45ed-a5f2-57d762042d87],
>     >     [P], s[INITIALIZING], reason [Failed to start shard,

message

>     >     [IndexShardGatewayRecoveryException[[twitter][1] Failed

to

>     recovery
>     >     translog]; nested:
EngineCreationFailureException[[twitter][1]
>     Failed to
>     >     open reader on writer]; nested:
>     >
>

FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs

>     >     (No such file or directory)]; ]]
>     >
>     >     Is the recovery process automatic, or do I have to do
something
>     >     special?  It appears to be just this one shard.
>     >
>     >     I use the service wrapper to start/stop 0.9.1-SNAPSHOT,
and my
>     config is
>     >     below.
>     >
>     >     ...Thanks,
>     >     ...Ken
>     >
>     >     cloud:
>     >        aws:
>     >            access_key: *****
>     >            secret_key: *****
>     >
>     >     gateway:
>     >        type: s3
>     >        s3:
>     >            bucket: *****
>     >
>     >     path :
>     >        work : /mnt/search-data-dev
>     >        logs : /mnt/search-data-dev/node1/logs
>     >
>     >     index :
>     >        number_of_shards : 2
>     >        number_of_replicas : 1
>     >
>     >     network :
>     >        host : 192.168.1.5
>     >
>     >
>
>

(Kenneth Loafman) #11

Will do so in just a bit...

Shay Banon wrote:

... can you test?

On Fri, Aug 20, 2010 at 4:02 PM, Shay Banon
<shay.banon@elasticsearch.com mailto:shay.banon@elasticsearch.com> wrote:

Just pushed a fix for this.


On Fri, Aug 20, 2010 at 3:31 PM, Kenneth Loafman
<kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>> wrote:

    Attachments did not make it.  See:
    http://pastebin.com/ziALRgx5 -- cluster state
    http://pastebin.com/63Xm95xM -- index status

    Sorry, they lost their formatting on Pastbin.

    ...Ken

    Kenneth Loafman wrote:
    > I upgraded to last nights version, restarted, and things are
    worse.  Now
    > I have 5 shards hung at recover, not all on the same node.  Weird.
    >
    > I've attached the info you want.  I'll leave things running
    for now.
    >
    > ...Thanks,
    > ...Ken
    >
    > Shay Banon wrote:
    >> Do you still have it running? Can you gist the cluster state
    and the
    >> index status results?
    >>
    >> I see that you are using master, I have fixed several things
    in this
    >> area, can you pull a new version?
    >>
    >> -shay.banon
    >>
    >> On Fri, Aug 20, 2010 at 12:33 AM, Kenneth Loafman
    >> <kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
    <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>> wrote:
    >>
    >>     It seems to have started recover, but it's been 7.5 hours
    and appears to
    >>     be stopped/hung...
    >>
    >>     >                 "1": [
    >>     >                     {
    >>     >                         "gateway_recovery": {
    >>     >                             "index": {
    >>     >                                
    "expected_recovered_size": "0b",
    >>     >                                
    "expected_recovered_size_in_bytes": 0,
    >>     >                                 "recovered_size": "0b",
    >>     >                                
    "recovered_size_in_bytes": 0,
    >>     >                                 "reused_size": "0b",
    >>     >                                 "reused_size_in_bytes": 0,
    >>     >                                 "size": "0b",
    >>     >                                 "size_in_bytes": 0,
    >>     >                                 "throttling_time": "0s",
    >>     >                                
    "throttling_time_in_millis": 0
    >>     >                             },
    >>     >                             "stage": "RETRY",
    >>     >                             "start_time_in_millis":
    1282226019603,
    >>     >                             "throttling_time": "7.6h",
    >>     >                            
    "throttling_time_in_millis": 27514627,
    >>     >                             "time": "7.6h",
    >>     >                             "time_in_millis": 27514657,
    >>     >                             "translog": {
    >>     >                                 "recovered": 0
    >>     >                             }
    >>     >                         },
    >>     >                         "index": {
    >>     >                             "size": "0b",
    >>     >                             "size_in_bytes": 0
    >>     >                         },
    >>     >                         "routing": {
    >>     >                             "index": "twitter",
    >>     >                             "node":
    >>     "031642a1-968f-40fb-b7c2-5a869769d5b4",
    >>     >                             "primary": true,
    >>     >                             "relocating_node": null,
    >>     >                             "shard": 1,
    >>     >                             "state": "INITIALIZING"
    >>     >                         },
    >>     >                         "state": "RECOVERING"
    >>     >                     }
    >>     >                 ]
    >>
    >>
    >>     Shay Banon wrote:
    >>     > It should be allocated on the other node, you shouldn't
    need to start
    >>     > another node. When you issue a cluster health (simple
    curl can
    >>     do), what
    >>     > is the status? The cluster state API gives you more
    information if you
    >>     > are after (each shard and its state).
    >>     >
    >>     > On Thu, Aug 19, 2010 at 3:48 PM, Kenneth Loafman
    >>     > <kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>> wrote:
    >>     >
    >>     >     No this is the first time.  The shutdown took a
    while with several
    >>     >     'Waiting for not to shutdown..." style message.  It
    came up bad
    >>     >     after that.
    >>     >
    >>     >     So, if I have two nodes now, and one needs to be
    recovered,
    >>     I'll need 3
    >>     >     nodes to get the recovery done?
    >>     >
    >>     >     ...Ken
    >>     >
    >>     >     Shay Banon wrote:
    >>     >     > The shard will allocated to another node and
    recovered there. Do
    >>     >     you see
    >>     >     > it happen continuously?
    >>     >     >
    >>     >     > -shay.banon
    >>     >     >
    >>     >     > On Thu, Aug 19, 2010 at 2:28 PM, Kenneth Loafman
    >>     >     > <kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    <mailto:kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >>     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >>     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>> wrote:
    >>     >     >
    >>     >     >     Hi,
    >>     >     >
    >>     >     >     The second shard on one of my indexes has
    failed due to:
    >>     >     >     [05:59:47,332][WARN ][index.gateway          
     ] [Mangog]
    >>     >     [twitter][1]
    >>     >     >     failed to snapshot on close
    >>     >     >     ...followed by a long traceback.
    >>     >     >     ...followed by:
    >>     >     >     [05:59:49,336][WARN ][cluster.action.shard  
      ] [Mangog]
    >>     >     received shard
    >>     >     >     failed for [twitter][1],
    >>     >     node[86d601df-e124-45ed-a5f2-57d762042d87],
    >>     >     >     [P], s[INITIALIZING], reason [Failed to start
    shard, message
    >>     >     >    
    [IndexShardGatewayRecoveryException[[twitter][1] Failed to
    >>     >     recovery
    >>     >     >     translog]; nested:
    >>     EngineCreationFailureException[[twitter][1]
    >>     >     Failed to
    >>     >     >     open reader on writer]; nested:
    >>     >     >
    >>     >
    >>    
    FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs
    >>     >     >     (No such file or directory)]; ]]
    >>     >     >
    >>     >     >     Is the recovery process automatic, or do I
    have to do
    >>     something
    >>     >     >     special?  It appears to be just this one shard.
    >>     >     >
    >>     >     >     I use the service wrapper to start/stop
    0.9.1-SNAPSHOT,
    >>     and my
    >>     >     config is
    >>     >     >     below.
    >>     >     >
    >>     >     >     ...Thanks,
    >>     >     >     ...Ken
    >>     >     >
    >>     >     >     cloud:
    >>     >     >        aws:
    >>     >     >            access_key: *****
    >>     >     >            secret_key: *****
    >>     >     >
    >>     >     >     gateway:
    >>     >     >        type: s3
    >>     >     >        s3:
    >>     >     >            bucket: *****
    >>     >     >
    >>     >     >     path :
    >>     >     >        work : /mnt/search-data-dev
    >>     >     >        logs : /mnt/search-data-dev/node1/logs
    >>     >     >
    >>     >     >     index :
    >>     >     >        number_of_shards : 2
    >>     >     >        number_of_replicas : 1
    >>     >     >
    >>     >     >     network :
    >>     >     >        host : 192.168.1.5
    >>     >     >
    >>     >     >
    >>     >
    >>     >
    >>
    >>
    >

(Kenneth Loafman) #12

I restarted and now 35 of 36 are successful, but if you look at the
status, it's showing multiple shards in recovery. I'm confused.

See cluster status in http://pastebin.com/9qWLf3mk

Kenneth Loafman wrote:

Will do so in just a bit...

Shay Banon wrote:

... can you test?

On Fri, Aug 20, 2010 at 4:02 PM, Shay Banon
<shay.banon@elasticsearch.com mailto:shay.banon@elasticsearch.com> wrote:

Just pushed a fix for this.


On Fri, Aug 20, 2010 at 3:31 PM, Kenneth Loafman
<kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>> wrote:

    Attachments did not make it.  See:
    http://pastebin.com/ziALRgx5 -- cluster state
    http://pastebin.com/63Xm95xM -- index status

    Sorry, they lost their formatting on Pastbin.

    ...Ken

    Kenneth Loafman wrote:
    > I upgraded to last nights version, restarted, and things are
    worse.  Now
    > I have 5 shards hung at recover, not all on the same node.  Weird.
    >
    > I've attached the info you want.  I'll leave things running
    for now.
    >
    > ...Thanks,
    > ...Ken
    >
    > Shay Banon wrote:
    >> Do you still have it running? Can you gist the cluster state
    and the
    >> index status results?
    >>
    >> I see that you are using master, I have fixed several things
    in this
    >> area, can you pull a new version?
    >>
    >> -shay.banon
    >>
    >> On Fri, Aug 20, 2010 at 12:33 AM, Kenneth Loafman
    >> <kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
    <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>> wrote:
    >>
    >>     It seems to have started recover, but it's been 7.5 hours
    and appears to
    >>     be stopped/hung...
    >>
    >>     >                 "1": [
    >>     >                     {
    >>     >                         "gateway_recovery": {
    >>     >                             "index": {
    >>     >                                
    "expected_recovered_size": "0b",
    >>     >                                
    "expected_recovered_size_in_bytes": 0,
    >>     >                                 "recovered_size": "0b",
    >>     >                                
    "recovered_size_in_bytes": 0,
    >>     >                                 "reused_size": "0b",
    >>     >                                 "reused_size_in_bytes": 0,
    >>     >                                 "size": "0b",
    >>     >                                 "size_in_bytes": 0,
    >>     >                                 "throttling_time": "0s",
    >>     >                                
    "throttling_time_in_millis": 0
    >>     >                             },
    >>     >                             "stage": "RETRY",
    >>     >                             "start_time_in_millis":
    1282226019603,
    >>     >                             "throttling_time": "7.6h",
    >>     >                            
    "throttling_time_in_millis": 27514627,
    >>     >                             "time": "7.6h",
    >>     >                             "time_in_millis": 27514657,
    >>     >                             "translog": {
    >>     >                                 "recovered": 0
    >>     >                             }
    >>     >                         },
    >>     >                         "index": {
    >>     >                             "size": "0b",
    >>     >                             "size_in_bytes": 0
    >>     >                         },
    >>     >                         "routing": {
    >>     >                             "index": "twitter",
    >>     >                             "node":
    >>     "031642a1-968f-40fb-b7c2-5a869769d5b4",
    >>     >                             "primary": true,
    >>     >                             "relocating_node": null,
    >>     >                             "shard": 1,
    >>     >                             "state": "INITIALIZING"
    >>     >                         },
    >>     >                         "state": "RECOVERING"
    >>     >                     }
    >>     >                 ]
    >>
    >>
    >>     Shay Banon wrote:
    >>     > It should be allocated on the other node, you shouldn't
    need to start
    >>     > another node. When you issue a cluster health (simple
    curl can
    >>     do), what
    >>     > is the status? The cluster state API gives you more
    information if you
    >>     > are after (each shard and its state).
    >>     >
    >>     > On Thu, Aug 19, 2010 at 3:48 PM, Kenneth Loafman
    >>     > <kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>> wrote:
    >>     >
    >>     >     No this is the first time.  The shutdown took a
    while with several
    >>     >     'Waiting for not to shutdown..." style message.  It
    came up bad
    >>     >     after that.
    >>     >
    >>     >     So, if I have two nodes now, and one needs to be
    recovered,
    >>     I'll need 3
    >>     >     nodes to get the recovery done?
    >>     >
    >>     >     ...Ken
    >>     >
    >>     >     Shay Banon wrote:
    >>     >     > The shard will allocated to another node and
    recovered there. Do
    >>     >     you see
    >>     >     > it happen continuously?
    >>     >     >
    >>     >     > -shay.banon
    >>     >     >
    >>     >     > On Thu, Aug 19, 2010 at 2:28 PM, Kenneth Loafman
    >>     >     > <kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    <mailto:kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >>     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >>     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>> wrote:
    >>     >     >
    >>     >     >     Hi,
    >>     >     >
    >>     >     >     The second shard on one of my indexes has
    failed due to:
    >>     >     >     [05:59:47,332][WARN ][index.gateway          
     ] [Mangog]
    >>     >     [twitter][1]
    >>     >     >     failed to snapshot on close
    >>     >     >     ...followed by a long traceback.
    >>     >     >     ...followed by:
    >>     >     >     [05:59:49,336][WARN ][cluster.action.shard  
      ] [Mangog]
    >>     >     received shard
    >>     >     >     failed for [twitter][1],
    >>     >     node[86d601df-e124-45ed-a5f2-57d762042d87],
    >>     >     >     [P], s[INITIALIZING], reason [Failed to start
    shard, message
    >>     >     >    
    [IndexShardGatewayRecoveryException[[twitter][1] Failed to
    >>     >     recovery
    >>     >     >     translog]; nested:
    >>     EngineCreationFailureException[[twitter][1]
    >>     >     Failed to
    >>     >     >     open reader on writer]; nested:
    >>     >     >
    >>     >
    >>    
    FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs
    >>     >     >     (No such file or directory)]; ]]
    >>     >     >
    >>     >     >     Is the recovery process automatic, or do I
    have to do
    >>     something
    >>     >     >     special?  It appears to be just this one shard.
    >>     >     >
    >>     >     >     I use the service wrapper to start/stop
    0.9.1-SNAPSHOT,
    >>     and my
    >>     >     config is
    >>     >     >     below.
    >>     >     >
    >>     >     >     ...Thanks,
    >>     >     >     ...Ken
    >>     >     >
    >>     >     >     cloud:
    >>     >     >        aws:
    >>     >     >            access_key: *****
    >>     >     >            secret_key: *****
    >>     >     >
    >>     >     >     gateway:
    >>     >     >        type: s3
    >>     >     >        s3:
    >>     >     >            bucket: *****
    >>     >     >
    >>     >     >     path :
    >>     >     >        work : /mnt/search-data-dev
    >>     >     >        logs : /mnt/search-data-dev/node1/logs
    >>     >     >
    >>     >     >     index :
    >>     >     >        number_of_shards : 2
    >>     >     >        number_of_replicas : 1
    >>     >     >
    >>     >     >     network :
    >>     >     >        host : 192.168.1.5
    >>     >     >
    >>     >     >
    >>     >
    >>     >
    >>
    >>
    >

(Shay Banon) #13

The top just states which shards were queries, a shard that is still not
allocated will obviously not be allocated. It seems like its still in
recovery process. There are two main APIs to really understand what is going
on (except for the high level health api), the cluster state API, that shows
you what the cluster wide state is (where each shard is supposed to be, what
its state is), and the status api which gives you detailed information of
the status of each shard allocated on each node.

Is the recovery progressing?

p.s. Can you use gist instead of pastebin?

-shay.banon

On Fri, Aug 20, 2010 at 5:13 PM, Kenneth Loafman
kenneth.loafman@gmail.comwrote:

I restarted and now 35 of 36 are successful, but if you look at the
status, it's showing multiple shards in recovery. I'm confused.

See cluster status in http://pastebin.com/9qWLf3mk

Kenneth Loafman wrote:

Will do so in just a bit...

Shay Banon wrote:

... can you test?

On Fri, Aug 20, 2010 at 4:02 PM, Shay Banon
<shay.banon@elasticsearch.com mailto:shay.banon@elasticsearch.com>
wrote:

Just pushed a fix for this.


On Fri, Aug 20, 2010 at 3:31 PM, Kenneth Loafman
<kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>>

wrote:

    Attachments did not make it.  See:
    http://pastebin.com/ziALRgx5 -- cluster state
    http://pastebin.com/63Xm95xM -- index status

    Sorry, they lost their formatting on Pastbin.

    ...Ken

    Kenneth Loafman wrote:
    > I upgraded to last nights version, restarted, and things are
    worse.  Now
    > I have 5 shards hung at recover, not all on the same node.

Weird.

    >
    > I've attached the info you want.  I'll leave things running
    for now.
    >
    > ...Thanks,
    > ...Ken
    >
    > Shay Banon wrote:
    >> Do you still have it running? Can you gist the cluster state
    and the
    >> index status results?
    >>
    >> I see that you are using master, I have fixed several things
    in this
    >> area, can you pull a new version?
    >>
    >> -shay.banon
    >>
    >> On Fri, Aug 20, 2010 at 12:33 AM, Kenneth Loafman
    >> <kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>> wrote:
    >>
    >>     It seems to have started recover, but it's been 7.5 hours
    and appears to
    >>     be stopped/hung...
    >>
    >>     >                 "1": [
    >>     >                     {
    >>     >                         "gateway_recovery": {
    >>     >                             "index": {
    >>     >
    "expected_recovered_size": "0b",
    >>     >
    "expected_recovered_size_in_bytes": 0,
    >>     >                                 "recovered_size": "0b",
    >>     >
    "recovered_size_in_bytes": 0,
    >>     >                                 "reused_size": "0b",
    >>     >                                 "reused_size_in_bytes":

0,

    >>     >                                 "size": "0b",
    >>     >                                 "size_in_bytes": 0,
    >>     >                                 "throttling_time":

"0s",

    >>     >
    "throttling_time_in_millis": 0
    >>     >                             },
    >>     >                             "stage": "RETRY",
    >>     >                             "start_time_in_millis":
    1282226019603,
    >>     >                             "throttling_time": "7.6h",
    >>     >
    "throttling_time_in_millis": 27514627,
    >>     >                             "time": "7.6h",
    >>     >                             "time_in_millis": 27514657,
    >>     >                             "translog": {
    >>     >                                 "recovered": 0
    >>     >                             }
    >>     >                         },
    >>     >                         "index": {
    >>     >                             "size": "0b",
    >>     >                             "size_in_bytes": 0
    >>     >                         },
    >>     >                         "routing": {
    >>     >                             "index": "twitter",
    >>     >                             "node":
    >>     "031642a1-968f-40fb-b7c2-5a869769d5b4",
    >>     >                             "primary": true,
    >>     >                             "relocating_node": null,
    >>     >                             "shard": 1,
    >>     >                             "state": "INITIALIZING"
    >>     >                         },
    >>     >                         "state": "RECOVERING"
    >>     >                     }
    >>     >                 ]
    >>
    >>
    >>     Shay Banon wrote:
    >>     > It should be allocated on the other node, you shouldn't
    need to start
    >>     > another node. When you issue a cluster health (simple
    curl can
    >>     do), what
    >>     > is the status? The cluster state API gives you more
    information if you
    >>     > are after (each shard and its state).
    >>     >
    >>     > On Thu, Aug 19, 2010 at 3:48 PM, Kenneth Loafman
    >>     > <kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>> wrote:
    >>     >
    >>     >     No this is the first time.  The shutdown took a
    while with several
    >>     >     'Waiting for not to shutdown..." style message.  It
    came up bad
    >>     >     after that.
    >>     >
    >>     >     So, if I have two nodes now, and one needs to be
    recovered,
    >>     I'll need 3
    >>     >     nodes to get the recovery done?
    >>     >
    >>     >     ...Ken
    >>     >
    >>     >     Shay Banon wrote:
    >>     >     > The shard will allocated to another node and
    recovered there. Do
    >>     >     you see
    >>     >     > it happen continuously?
    >>     >     >
    >>     >     > -shay.banon
    >>     >     >
    >>     >     > On Thu, Aug 19, 2010 at 2:28 PM, Kenneth Loafman
    >>     >     > <kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    <mailto:kenneth.loafman@gmail.com <mailto:

kenneth.loafman@gmail.com>

    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >>     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >>     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>> wrote:
    >>     >     >
    >>     >     >     Hi,
    >>     >     >
    >>     >     >     The second shard on one of my indexes has
    failed due to:
    >>     >     >     [05:59:47,332][WARN ][index.gateway
     ] [Mangog]
    >>     >     [twitter][1]
    >>     >     >     failed to snapshot on close
    >>     >     >     ...followed by a long traceback.
    >>     >     >     ...followed by:
    >>     >     >     [05:59:49,336][WARN ][cluster.action.shard
      ] [Mangog]
    >>     >     received shard
    >>     >     >     failed for [twitter][1],
    >>     >     node[86d601df-e124-45ed-a5f2-57d762042d87],
    >>     >     >     [P], s[INITIALIZING], reason [Failed to start
    shard, message
    >>     >     >
    [IndexShardGatewayRecoveryException[[twitter][1] Failed to
    >>     >     recovery
    >>     >     >     translog]; nested:
    >>     EngineCreationFailureException[[twitter][1]
    >>     >     Failed to
    >>     >     >     open reader on writer]; nested:
    >>     >     >
    >>     >
    >>

FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs

    >>     >     >     (No such file or directory)]; ]]
    >>     >     >
    >>     >     >     Is the recovery process automatic, or do I
    have to do
    >>     something
    >>     >     >     special?  It appears to be just this one

shard.

    >>     >     >
    >>     >     >     I use the service wrapper to start/stop
    0.9.1-SNAPSHOT,
    >>     and my
    >>     >     config is
    >>     >     >     below.
    >>     >     >
    >>     >     >     ...Thanks,
    >>     >     >     ...Ken
    >>     >     >
    >>     >     >     cloud:
    >>     >     >        aws:
    >>     >     >            access_key: *****
    >>     >     >            secret_key: *****
    >>     >     >
    >>     >     >     gateway:
    >>     >     >        type: s3
    >>     >     >        s3:
    >>     >     >            bucket: *****
    >>     >     >
    >>     >     >     path :
    >>     >     >        work : /mnt/search-data-dev
    >>     >     >        logs : /mnt/search-data-dev/node1/logs
    >>     >     >
    >>     >     >     index :
    >>     >     >        number_of_shards : 2
    >>     >     >        number_of_replicas : 1
    >>     >     >
    >>     >     >     network :
    >>     >     >        host : 192.168.1.5
    >>     >     >
    >>     >     >
    >>     >
    >>     >
    >>
    >>
    >

(Kenneth Loafman) #14

I think so... Here's the latest on gist http://gist.github.com/540471

Thanks for the pointer on gist, I've never used it before.

Shay Banon wrote:

The top just states which shards were queries, a shard that is still not
allocated will obviously not be allocated. It seems like its still in
recovery process. There are two main APIs to really understand what is
going on (except for the high level health api), the cluster state API,
that shows you what the cluster wide state is (where each shard is
supposed to be, what its state is), and the status api which gives you
detailed information of the status of each shard allocated on each node.

Is the recovery progressing?

p.s. Can you use gist instead of pastebin?

-shay.banon

On Fri, Aug 20, 2010 at 5:13 PM, Kenneth Loafman
<kenneth.loafman@gmail.com mailto:kenneth.loafman@gmail.com> wrote:

I restarted and now 35 of 36 are successful, but if you look at the
status, it's showing multiple shards in recovery.  I'm confused.

See cluster status in http://pastebin.com/9qWLf3mk

Kenneth Loafman wrote:
> Will do so in just a bit...
>
> Shay Banon wrote:
>> ... can you test?
>>
>> On Fri, Aug 20, 2010 at 4:02 PM, Shay Banon
>> <shay.banon@elasticsearch.com
<mailto:shay.banon@elasticsearch.com>
<mailto:shay.banon@elasticsearch.com
<mailto:shay.banon@elasticsearch.com>>> wrote:
>>
>>     Just pushed a fix for this.
>>
>>
>>     On Fri, Aug 20, 2010 at 3:31 PM, Kenneth Loafman
>>     <kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
<mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>> wrote:
>>
>>         Attachments did not make it.  See:
>>         http://pastebin.com/ziALRgx5 -- cluster state
>>         http://pastebin.com/63Xm95xM -- index status
>>
>>         Sorry, they lost their formatting on Pastbin.
>>
>>         ...Ken
>>
>>         Kenneth Loafman wrote:
>>         > I upgraded to last nights version, restarted, and
things are
>>         worse.  Now
>>         > I have 5 shards hung at recover, not all on the same
node.  Weird.
>>         >
>>         > I've attached the info you want.  I'll leave things running
>>         for now.
>>         >
>>         > ...Thanks,
>>         > ...Ken
>>         >
>>         > Shay Banon wrote:
>>         >> Do you still have it running? Can you gist the cluster
state
>>         and the
>>         >> index status results?
>>         >>
>>         >> I see that you are using master, I have fixed several
things
>>         in this
>>         >> area, can you pull a new version?
>>         >>
>>         >> -shay.banon
>>         >>
>>         >> On Fri, Aug 20, 2010 at 12:33 AM, Kenneth Loafman
>>         >> <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com> <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>> wrote:
>>         >>
>>         >>     It seems to have started recover, but it's been
7.5 hours
>>         and appears to
>>         >>     be stopped/hung...
>>         >>
>>         >>     >                 "1": [
>>         >>     >                     {
>>         >>     >                         "gateway_recovery": {
>>         >>     >                             "index": {
>>         >>     >
>>         "expected_recovered_size": "0b",
>>         >>     >
>>         "expected_recovered_size_in_bytes": 0,
>>         >>     >                                
"recovered_size": "0b",
>>         >>     >
>>         "recovered_size_in_bytes": 0,
>>         >>     >                                 "reused_size": "0b",
>>         >>     >                                
"reused_size_in_bytes": 0,
>>         >>     >                                 "size": "0b",
>>         >>     >                                 "size_in_bytes": 0,
>>         >>     >                                
"throttling_time": "0s",
>>         >>     >
>>         "throttling_time_in_millis": 0
>>         >>     >                             },
>>         >>     >                             "stage": "RETRY",
>>         >>     >                             "start_time_in_millis":
>>         1282226019603,
>>         >>     >                             "throttling_time":
"7.6h",
>>         >>     >
>>         "throttling_time_in_millis": 27514627,
>>         >>     >                             "time": "7.6h",
>>         >>     >                             "time_in_millis":
27514657,
>>         >>     >                             "translog": {
>>         >>     >                                 "recovered": 0
>>         >>     >                             }
>>         >>     >                         },
>>         >>     >                         "index": {
>>         >>     >                             "size": "0b",
>>         >>     >                             "size_in_bytes": 0
>>         >>     >                         },
>>         >>     >                         "routing": {
>>         >>     >                             "index": "twitter",
>>         >>     >                             "node":
>>         >>     "031642a1-968f-40fb-b7c2-5a869769d5b4",
>>         >>     >                             "primary": true,
>>         >>     >                             "relocating_node": null,
>>         >>     >                             "shard": 1,
>>         >>     >                             "state": "INITIALIZING"
>>         >>     >                         },
>>         >>     >                         "state": "RECOVERING"
>>         >>     >                     }
>>         >>     >                 ]
>>         >>
>>         >>
>>         >>     Shay Banon wrote:
>>         >>     > It should be allocated on the other node, you
shouldn't
>>         need to start
>>         >>     > another node. When you issue a cluster health
(simple
>>         curl can
>>         >>     do), what
>>         >>     > is the status? The cluster state API gives you more
>>         information if you
>>         >>     > are after (each shard and its state).
>>         >>     >
>>         >>     > On Thu, Aug 19, 2010 at 3:48 PM, Kenneth Loafman
>>         >>     > <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>> wrote:
>>         >>     >
>>         >>     >     No this is the first time.  The shutdown took a
>>         while with several
>>         >>     >     'Waiting for not to shutdown..." style
message.  It
>>         came up bad
>>         >>     >     after that.
>>         >>     >
>>         >>     >     So, if I have two nodes now, and one needs to be
>>         recovered,
>>         >>     I'll need 3
>>         >>     >     nodes to get the recovery done?
>>         >>     >
>>         >>     >     ...Ken
>>         >>     >
>>         >>     >     Shay Banon wrote:
>>         >>     >     > The shard will allocated to another node and
>>         recovered there. Do
>>         >>     >     you see
>>         >>     >     > it happen continuously?
>>         >>     >     >
>>         >>     >     > -shay.banon
>>         >>     >     >
>>         >>     >     > On Thu, Aug 19, 2010 at 2:28 PM, Kenneth
Loafman
>>         >>     >     > <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com> <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>
>>         >>     >     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>>         >>     >     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>>> wrote:
>>         >>     >     >
>>         >>     >     >     Hi,
>>         >>     >     >
>>         >>     >     >     The second shard on one of my indexes has
>>         failed due to:
>>         >>     >     >     [05:59:47,332][WARN ][index.gateway
>>          ] [Mangog]
>>         >>     >     [twitter][1]
>>         >>     >     >     failed to snapshot on close
>>         >>     >     >     ...followed by a long traceback.
>>         >>     >     >     ...followed by:
>>         >>     >     >     [05:59:49,336][WARN ][cluster.action.shard
>>           ] [Mangog]
>>         >>     >     received shard
>>         >>     >     >     failed for [twitter][1],
>>         >>     >     node[86d601df-e124-45ed-a5f2-57d762042d87],
>>         >>     >     >     [P], s[INITIALIZING], reason [Failed
to start
>>         shard, message
>>         >>     >     >
>>         [IndexShardGatewayRecoveryException[[twitter][1] Failed to
>>         >>     >     recovery
>>         >>     >     >     translog]; nested:
>>         >>     EngineCreationFailureException[[twitter][1]
>>         >>     >     Failed to
>>         >>     >     >     open reader on writer]; nested:
>>         >>     >     >
>>         >>     >
>>         >>
>>        
FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs
>>         >>     >     >     (No such file or directory)]; ]]
>>         >>     >     >
>>         >>     >     >     Is the recovery process automatic, or do I
>>         have to do
>>         >>     something
>>         >>     >     >     special?  It appears to be just this
one shard.
>>         >>     >     >
>>         >>     >     >     I use the service wrapper to start/stop
>>         0.9.1-SNAPSHOT,
>>         >>     and my
>>         >>     >     config is
>>         >>     >     >     below.
>>         >>     >     >
>>         >>     >     >     ...Thanks,
>>         >>     >     >     ...Ken
>>         >>     >     >
>>         >>     >     >     cloud:
>>         >>     >     >        aws:
>>         >>     >     >            access_key: *****
>>         >>     >     >            secret_key: *****
>>         >>     >     >
>>         >>     >     >     gateway:
>>         >>     >     >        type: s3
>>         >>     >     >        s3:
>>         >>     >     >            bucket: *****
>>         >>     >     >
>>         >>     >     >     path :
>>         >>     >     >        work : /mnt/search-data-dev
>>         >>     >     >        logs : /mnt/search-data-dev/node1/logs
>>         >>     >     >
>>         >>     >     >     index :
>>         >>     >     >        number_of_shards : 2
>>         >>     >     >        number_of_replicas : 1
>>         >>     >     >
>>         >>     >     >     network :
>>         >>     >     >        host : 192.168.1.5
>>         >>     >     >
>>         >>     >     >
>>         >>     >
>>         >>     >
>>         >>
>>         >>
>>         >
>>
>>
>>
>

(Shay Banon) #15

great, ping me if it does not end, I am here to help (we can make it more
interactive on IRC).

p.s. Can you keep the original json format when you gist? Much easier to
know whats going on. You can add pretty=true as a parameter to get it pretty
printed.

-shay.banon

On Fri, Aug 20, 2010 at 5:51 PM, Kenneth Loafman
kenneth.loafman@gmail.comwrote:

I think so... Here's the latest on gist http://gist.github.com/540471

Thanks for the pointer on gist, I've never used it before.

Shay Banon wrote:

The top just states which shards were queries, a shard that is still not
allocated will obviously not be allocated. It seems like its still in
recovery process. There are two main APIs to really understand what is
going on (except for the high level health api), the cluster state API,
that shows you what the cluster wide state is (where each shard is
supposed to be, what its state is), and the status api which gives you
detailed information of the status of each shard allocated on each node.

Is the recovery progressing?

p.s. Can you use gist instead of pastebin?

-shay.banon

On Fri, Aug 20, 2010 at 5:13 PM, Kenneth Loafman
<kenneth.loafman@gmail.com mailto:kenneth.loafman@gmail.com> wrote:

I restarted and now 35 of 36 are successful, but if you look at the
status, it's showing multiple shards in recovery.  I'm confused.

See cluster status in http://pastebin.com/9qWLf3mk

Kenneth Loafman wrote:
> Will do so in just a bit...
>
> Shay Banon wrote:
>> ... can you test?
>>
>> On Fri, Aug 20, 2010 at 4:02 PM, Shay Banon
>> <shay.banon@elasticsearch.com
<mailto:shay.banon@elasticsearch.com>
<mailto:shay.banon@elasticsearch.com
<mailto:shay.banon@elasticsearch.com>>> wrote:
>>
>>     Just pushed a fix for this.
>>
>>
>>     On Fri, Aug 20, 2010 at 3:31 PM, Kenneth Loafman
>>     <kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
<mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>> wrote:
>>
>>         Attachments did not make it.  See:
>>         http://pastebin.com/ziALRgx5 -- cluster state
>>         http://pastebin.com/63Xm95xM -- index status
>>
>>         Sorry, they lost their formatting on Pastbin.
>>
>>         ...Ken
>>
>>         Kenneth Loafman wrote:
>>         > I upgraded to last nights version, restarted, and
things are
>>         worse.  Now
>>         > I have 5 shards hung at recover, not all on the same
node.  Weird.
>>         >
>>         > I've attached the info you want.  I'll leave things

running

>>         for now.
>>         >
>>         > ...Thanks,
>>         > ...Ken
>>         >
>>         > Shay Banon wrote:
>>         >> Do you still have it running? Can you gist the cluster
state
>>         and the
>>         >> index status results?
>>         >>
>>         >> I see that you are using master, I have fixed several
things
>>         in this
>>         >> area, can you pull a new version?
>>         >>
>>         >> -shay.banon
>>         >>
>>         >> On Fri, Aug 20, 2010 at 12:33 AM, Kenneth Loafman
>>         >> <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com> <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>> wrote:
>>         >>
>>         >>     It seems to have started recover, but it's been
7.5 hours
>>         and appears to
>>         >>     be stopped/hung...
>>         >>
>>         >>     >                 "1": [
>>         >>     >                     {
>>         >>     >                         "gateway_recovery": {
>>         >>     >                             "index": {
>>         >>     >
>>         "expected_recovered_size": "0b",
>>         >>     >
>>         "expected_recovered_size_in_bytes": 0,
>>         >>     >
"recovered_size": "0b",
>>         >>     >
>>         "recovered_size_in_bytes": 0,
>>         >>     >                                 "reused_size":

"0b",

>>         >>     >
"reused_size_in_bytes": 0,
>>         >>     >                                 "size": "0b",
>>         >>     >                                 "size_in_bytes":

0,

>>         >>     >
"throttling_time": "0s",
>>         >>     >
>>         "throttling_time_in_millis": 0
>>         >>     >                             },
>>         >>     >                             "stage": "RETRY",
>>         >>     >

"start_time_in_millis":

>>         1282226019603,
>>         >>     >                             "throttling_time":
"7.6h",
>>         >>     >
>>         "throttling_time_in_millis": 27514627,
>>         >>     >                             "time": "7.6h",
>>         >>     >                             "time_in_millis":
27514657,
>>         >>     >                             "translog": {
>>         >>     >                                 "recovered": 0
>>         >>     >                             }
>>         >>     >                         },
>>         >>     >                         "index": {
>>         >>     >                             "size": "0b",
>>         >>     >                             "size_in_bytes": 0
>>         >>     >                         },
>>         >>     >                         "routing": {
>>         >>     >                             "index": "twitter",
>>         >>     >                             "node":
>>         >>     "031642a1-968f-40fb-b7c2-5a869769d5b4",
>>         >>     >                             "primary": true,
>>         >>     >                             "relocating_node":

null,

>>         >>     >                             "shard": 1,
>>         >>     >                             "state":

"INITIALIZING"

>>         >>     >                         },
>>         >>     >                         "state": "RECOVERING"
>>         >>     >                     }
>>         >>     >                 ]
>>         >>
>>         >>
>>         >>     Shay Banon wrote:
>>         >>     > It should be allocated on the other node, you
shouldn't
>>         need to start
>>         >>     > another node. When you issue a cluster health
(simple
>>         curl can
>>         >>     do), what
>>         >>     > is the status? The cluster state API gives you

more

>>         information if you
>>         >>     > are after (each shard and its state).
>>         >>     >
>>         >>     > On Thu, Aug 19, 2010 at 3:48 PM, Kenneth Loafman
>>         >>     > <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>> wrote:
>>         >>     >
>>         >>     >     No this is the first time.  The shutdown took

a

>>         while with several
>>         >>     >     'Waiting for not to shutdown..." style
message.  It
>>         came up bad
>>         >>     >     after that.
>>         >>     >
>>         >>     >     So, if I have two nodes now, and one needs to

be

>>         recovered,
>>         >>     I'll need 3
>>         >>     >     nodes to get the recovery done?
>>         >>     >
>>         >>     >     ...Ken
>>         >>     >
>>         >>     >     Shay Banon wrote:
>>         >>     >     > The shard will allocated to another node

and

>>         recovered there. Do
>>         >>     >     you see
>>         >>     >     > it happen continuously?
>>         >>     >     >
>>         >>     >     > -shay.banon
>>         >>     >     >
>>         >>     >     > On Thu, Aug 19, 2010 at 2:28 PM, Kenneth
Loafman
>>         >>     >     > <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com> <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>
>>         >>     >     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>>         >>     >     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>>> wrote:
>>         >>     >     >
>>         >>     >     >     Hi,
>>         >>     >     >
>>         >>     >     >     The second shard on one of my indexes

has

>>         failed due to:
>>         >>     >     >     [05:59:47,332][WARN ][index.gateway
>>          ] [Mangog]
>>         >>     >     [twitter][1]
>>         >>     >     >     failed to snapshot on close
>>         >>     >     >     ...followed by a long traceback.
>>         >>     >     >     ...followed by:
>>         >>     >     >     [05:59:49,336][WARN

][cluster.action.shard

>>           ] [Mangog]
>>         >>     >     received shard
>>         >>     >     >     failed for [twitter][1],
>>         >>     >     node[86d601df-e124-45ed-a5f2-57d762042d87],
>>         >>     >     >     [P], s[INITIALIZING], reason [Failed
to start
>>         shard, message
>>         >>     >     >
>>         [IndexShardGatewayRecoveryException[[twitter][1] Failed to
>>         >>     >     recovery
>>         >>     >     >     translog]; nested:
>>         >>     EngineCreationFailureException[[twitter][1]
>>         >>     >     Failed to
>>         >>     >     >     open reader on writer]; nested:
>>         >>     >     >
>>         >>     >
>>         >>
>>

FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs

>>         >>     >     >     (No such file or directory)]; ]]
>>         >>     >     >
>>         >>     >     >     Is the recovery process automatic, or

do I

>>         have to do
>>         >>     something
>>         >>     >     >     special?  It appears to be just this
one shard.
>>         >>     >     >
>>         >>     >     >     I use the service wrapper to start/stop
>>         0.9.1-SNAPSHOT,
>>         >>     and my
>>         >>     >     config is
>>         >>     >     >     below.
>>         >>     >     >
>>         >>     >     >     ...Thanks,
>>         >>     >     >     ...Ken
>>         >>     >     >
>>         >>     >     >     cloud:
>>         >>     >     >        aws:
>>         >>     >     >            access_key: *****
>>         >>     >     >            secret_key: *****
>>         >>     >     >
>>         >>     >     >     gateway:
>>         >>     >     >        type: s3
>>         >>     >     >        s3:
>>         >>     >     >            bucket: *****
>>         >>     >     >
>>         >>     >     >     path :
>>         >>     >     >        work : /mnt/search-data-dev
>>         >>     >     >        logs :

/mnt/search-data-dev/node1/logs

>>         >>     >     >
>>         >>     >     >     index :
>>         >>     >     >        number_of_shards : 2
>>         >>     >     >        number_of_replicas : 1
>>         >>     >     >
>>         >>     >     >     network :
>>         >>     >     >        host : 192.168.1.5
>>         >>     >     >
>>         >>     >     >
>>         >>     >
>>         >>     >
>>         >>
>>         >>
>>         >
>>
>>
>>
>

(Kenneth Loafman) #16

Now its looping: progress is going to 100, then starting over.

I set up a 1/second loop using:
while /bin/true; do date; curl -XGET
'http://192.168.1.5:9200/twitter/_status?pretty=true'; sleep 1; done
then copied it to gist at: http://gist.github.com/540711

It should have recovered by now, I would think.

...Ken

Shay Banon wrote:

great, ping me if it does not end, I am here to help (we can make it
more interactive on IRC).

p.s. Can you keep the original json format when you gist? Much easier to
know whats going on. You can add pretty=true as a parameter to get it
pretty printed.

-shay.banon

On Fri, Aug 20, 2010 at 5:51 PM, Kenneth Loafman
<kenneth.loafman@gmail.com mailto:kenneth.loafman@gmail.com> wrote:

I think so... Here's the latest on gist http://gist.github.com/540471

Thanks for the pointer on gist, I've never used it before.

Shay Banon wrote:
> The top just states which shards were queries, a shard that is
still not
> allocated will obviously not be allocated. It seems like its still in
> recovery process. There are two main APIs to really understand what is
> going on (except for the high level health api), the cluster state
API,
> that shows you what the cluster wide state is (where each shard is
> supposed to be, what its state is), and the status api which gives you
> detailed information of the status of each shard allocated on each
node.
>
> Is the recovery progressing?
>
> p.s. Can you use gist instead of pastebin?
>
> -shay.banon
>
> On Fri, Aug 20, 2010 at 5:13 PM, Kenneth Loafman
> <kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
<mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>> wrote:
>
>     I restarted and now 35 of 36 are successful, but if you look
at the
>     status, it's showing multiple shards in recovery.  I'm confused.
>
>     See cluster status in http://pastebin.com/9qWLf3mk
>
>     Kenneth Loafman wrote:
>     > Will do so in just a bit...
>     >
>     > Shay Banon wrote:
>     >> ... can you test?
>     >>
>     >> On Fri, Aug 20, 2010 at 4:02 PM, Shay Banon
>     >> <shay.banon@elasticsearch.com
<mailto:shay.banon@elasticsearch.com>
>     <mailto:shay.banon@elasticsearch.com
<mailto:shay.banon@elasticsearch.com>>
>     <mailto:shay.banon@elasticsearch.com
<mailto:shay.banon@elasticsearch.com>
>     <mailto:shay.banon@elasticsearch.com
<mailto:shay.banon@elasticsearch.com>>>> wrote:
>     >>
>     >>     Just pushed a fix for this.
>     >>
>     >>
>     >>     On Fri, Aug 20, 2010 at 3:31 PM, Kenneth Loafman
>     >>     <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com> <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>> wrote:
>     >>
>     >>         Attachments did not make it.  See:
>     >>         http://pastebin.com/ziALRgx5 -- cluster state
>     >>         http://pastebin.com/63Xm95xM -- index status
>     >>
>     >>         Sorry, they lost their formatting on Pastbin.
>     >>
>     >>         ...Ken
>     >>
>     >>         Kenneth Loafman wrote:
>     >>         > I upgraded to last nights version, restarted, and
>     things are
>     >>         worse.  Now
>     >>         > I have 5 shards hung at recover, not all on the same
>     node.  Weird.
>     >>         >
>     >>         > I've attached the info you want.  I'll leave
things running
>     >>         for now.
>     >>         >
>     >>         > ...Thanks,
>     >>         > ...Ken
>     >>         >
>     >>         > Shay Banon wrote:
>     >>         >> Do you still have it running? Can you gist the
cluster
>     state
>     >>         and the
>     >>         >> index status results?
>     >>         >>
>     >>         >> I see that you are using master, I have fixed
several
>     things
>     >>         in this
>     >>         >> area, can you pull a new version?
>     >>         >>
>     >>         >> -shay.banon
>     >>         >>
>     >>         >> On Fri, Aug 20, 2010 at 12:33 AM, Kenneth Loafman
>     >>         >> <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
<mailto:kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>> wrote:
>     >>         >>
>     >>         >>     It seems to have started recover, but it's been
>     7.5 hours
>     >>         and appears to
>     >>         >>     be stopped/hung...
>     >>         >>
>     >>         >>     >                 "1": [
>     >>         >>     >                     {
>     >>         >>     >                         "gateway_recovery": {
>     >>         >>     >                             "index": {
>     >>         >>     >
>     >>         "expected_recovered_size": "0b",
>     >>         >>     >
>     >>         "expected_recovered_size_in_bytes": 0,
>     >>         >>     >
>     "recovered_size": "0b",
>     >>         >>     >
>     >>         "recovered_size_in_bytes": 0,
>     >>         >>     >                                
"reused_size": "0b",
>     >>         >>     >
>     "reused_size_in_bytes": 0,
>     >>         >>     >                                 "size": "0b",
>     >>         >>     >                                
"size_in_bytes": 0,
>     >>         >>     >
>     "throttling_time": "0s",
>     >>         >>     >
>     >>         "throttling_time_in_millis": 0
>     >>         >>     >                             },
>     >>         >>     >                             "stage": "RETRY",
>     >>         >>     >                            
"start_time_in_millis":
>     >>         1282226019603,
>     >>         >>     >                             "throttling_time":
>     "7.6h",
>     >>         >>     >
>     >>         "throttling_time_in_millis": 27514627,
>     >>         >>     >                             "time": "7.6h",
>     >>         >>     >                             "time_in_millis":
>     27514657,
>     >>         >>     >                             "translog": {
>     >>         >>     >                                 "recovered": 0
>     >>         >>     >                             }
>     >>         >>     >                         },
>     >>         >>     >                         "index": {
>     >>         >>     >                             "size": "0b",
>     >>         >>     >                             "size_in_bytes": 0
>     >>         >>     >                         },
>     >>         >>     >                         "routing": {
>     >>         >>     >                             "index":
"twitter",
>     >>         >>     >                             "node":
>     >>         >>     "031642a1-968f-40fb-b7c2-5a869769d5b4",
>     >>         >>     >                             "primary": true,
>     >>         >>     >                            
"relocating_node": null,
>     >>         >>     >                             "shard": 1,
>     >>         >>     >                             "state":
"INITIALIZING"
>     >>         >>     >                         },
>     >>         >>     >                         "state": "RECOVERING"
>     >>         >>     >                     }
>     >>         >>     >                 ]
>     >>         >>
>     >>         >>
>     >>         >>     Shay Banon wrote:
>     >>         >>     > It should be allocated on the other node, you
>     shouldn't
>     >>         need to start
>     >>         >>     > another node. When you issue a cluster health
>     (simple
>     >>         curl can
>     >>         >>     do), what
>     >>         >>     > is the status? The cluster state API gives
you more
>     >>         information if you
>     >>         >>     > are after (each shard and its state).
>     >>         >>     >
>     >>         >>     > On Thu, Aug 19, 2010 at 3:48 PM, Kenneth
Loafman
>     >>         >>     > <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>>> wrote:
>     >>         >>     >
>     >>         >>     >     No this is the first time.  The
shutdown took a
>     >>         while with several
>     >>         >>     >     'Waiting for not to shutdown..." style
>     message.  It
>     >>         came up bad
>     >>         >>     >     after that.
>     >>         >>     >
>     >>         >>     >     So, if I have two nodes now, and one
needs to be
>     >>         recovered,
>     >>         >>     I'll need 3
>     >>         >>     >     nodes to get the recovery done?
>     >>         >>     >
>     >>         >>     >     ...Ken
>     >>         >>     >
>     >>         >>     >     Shay Banon wrote:
>     >>         >>     >     > The shard will allocated to another
node and
>     >>         recovered there. Do
>     >>         >>     >     you see
>     >>         >>     >     > it happen continuously?
>     >>         >>     >     >
>     >>         >>     >     > -shay.banon
>     >>         >>     >     >
>     >>         >>     >     > On Thu, Aug 19, 2010 at 2:28 PM, Kenneth
>     Loafman
>     >>         >>     >     > <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
<mailto:kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>>
>     >>         >>     >     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>
>     >>         >>     >     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>>>> wrote:
>     >>         >>     >     >
>     >>         >>     >     >     Hi,
>     >>         >>     >     >
>     >>         >>     >     >     The second shard on one of my
indexes has
>     >>         failed due to:
>     >>         >>     >     >     [05:59:47,332][WARN ][index.gateway
>     >>          ] [Mangog]
>     >>         >>     >     [twitter][1]
>     >>         >>     >     >     failed to snapshot on close
>     >>         >>     >     >     ...followed by a long traceback.
>     >>         >>     >     >     ...followed by:
>     >>         >>     >     >     [05:59:49,336][WARN
][cluster.action.shard
>     >>           ] [Mangog]
>     >>         >>     >     received shard
>     >>         >>     >     >     failed for [twitter][1],
>     >>         >>     >    
node[86d601df-e124-45ed-a5f2-57d762042d87],
>     >>         >>     >     >     [P], s[INITIALIZING], reason [Failed
>     to start
>     >>         shard, message
>     >>         >>     >     >
>     >>         [IndexShardGatewayRecoveryException[[twitter][1]
Failed to
>     >>         >>     >     recovery
>     >>         >>     >     >     translog]; nested:
>     >>         >>     EngineCreationFailureException[[twitter][1]
>     >>         >>     >     Failed to
>     >>         >>     >     >     open reader on writer]; nested:
>     >>         >>     >     >
>     >>         >>     >
>     >>         >>
>     >>
>    
FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs
>     >>         >>     >     >     (No such file or directory)]; ]]
>     >>         >>     >     >
>     >>         >>     >     >     Is the recovery process
automatic, or do I
>     >>         have to do
>     >>         >>     something
>     >>         >>     >     >     special?  It appears to be just this
>     one shard.
>     >>         >>     >     >
>     >>         >>     >     >     I use the service wrapper to
start/stop
>     >>         0.9.1-SNAPSHOT,
>     >>         >>     and my
>     >>         >>     >     config is
>     >>         >>     >     >     below.
>     >>         >>     >     >
>     >>         >>     >     >     ...Thanks,
>     >>         >>     >     >     ...Ken
>     >>         >>     >     >
>     >>         >>     >     >     cloud:
>     >>         >>     >     >        aws:
>     >>         >>     >     >            access_key: *****
>     >>         >>     >     >            secret_key: *****
>     >>         >>     >     >
>     >>         >>     >     >     gateway:
>     >>         >>     >     >        type: s3
>     >>         >>     >     >        s3:
>     >>         >>     >     >            bucket: *****
>     >>         >>     >     >
>     >>         >>     >     >     path :
>     >>         >>     >     >        work : /mnt/search-data-dev
>     >>         >>     >     >        logs :
/mnt/search-data-dev/node1/logs
>     >>         >>     >     >
>     >>         >>     >     >     index :
>     >>         >>     >     >        number_of_shards : 2
>     >>         >>     >     >        number_of_replicas : 1
>     >>         >>     >     >
>     >>         >>     >     >     network :
>     >>         >>     >     >        host : 192.168.1.5
>     >>         >>     >     >
>     >>         >>     >     >
>     >>         >>     >
>     >>         >>     >
>     >>         >>
>     >>         >>
>     >>         >
>     >>
>     >>
>     >>
>     >
>
>

(Shay Banon) #17

Do you see any exceptions in the logs (failing to start the shard)?

On Fri, Aug 20, 2010 at 8:02 PM, Kenneth Loafman
kenneth.loafman@gmail.comwrote:

Now its looping: progress is going to 100, then starting over.

I set up a 1/second loop using:
while /bin/true; do date; curl -XGET
'http://192.168.1.5:9200/twitter/_status?pretty=true'; sleep 1; done
then copied it to gist at: http://gist.github.com/540711

It should have recovered by now, I would think.

...Ken

Shay Banon wrote:

great, ping me if it does not end, I am here to help (we can make it
more interactive on IRC).

p.s. Can you keep the original json format when you gist? Much easier to
know whats going on. You can add pretty=true as a parameter to get it
pretty printed.

-shay.banon

On Fri, Aug 20, 2010 at 5:51 PM, Kenneth Loafman
<kenneth.loafman@gmail.com mailto:kenneth.loafman@gmail.com> wrote:

I think so... Here's the latest on gist

http://gist.github.com/540471

Thanks for the pointer on gist, I've never used it before.

Shay Banon wrote:
> The top just states which shards were queries, a shard that is
still not
> allocated will obviously not be allocated. It seems like its still

in

> recovery process. There are two main APIs to really understand what

is

> going on (except for the high level health api), the cluster state
API,
> that shows you what the cluster wide state is (where each shard is
> supposed to be, what its state is), and the status api which gives

you

> detailed information of the status of each shard allocated on each
node.
>
> Is the recovery progressing?
>
> p.s. Can you use gist instead of pastebin?
>
> -shay.banon
>
> On Fri, Aug 20, 2010 at 5:13 PM, Kenneth Loafman
> <kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
<mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>> wrote:
>
>     I restarted and now 35 of 36 are successful, but if you look
at the
>     status, it's showing multiple shards in recovery.  I'm

confused.

>
>     See cluster status in http://pastebin.com/9qWLf3mk
>
>     Kenneth Loafman wrote:
>     > Will do so in just a bit...
>     >
>     > Shay Banon wrote:
>     >> ... can you test?
>     >>
>     >> On Fri, Aug 20, 2010 at 4:02 PM, Shay Banon
>     >> <shay.banon@elasticsearch.com
<mailto:shay.banon@elasticsearch.com>
>     <mailto:shay.banon@elasticsearch.com
<mailto:shay.banon@elasticsearch.com>>
>     <mailto:shay.banon@elasticsearch.com
<mailto:shay.banon@elasticsearch.com>
>     <mailto:shay.banon@elasticsearch.com
<mailto:shay.banon@elasticsearch.com>>>> wrote:
>     >>
>     >>     Just pushed a fix for this.
>     >>
>     >>
>     >>     On Fri, Aug 20, 2010 at 3:31 PM, Kenneth Loafman
>     >>     <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com> <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>> wrote:
>     >>
>     >>         Attachments did not make it.  See:
>     >>         http://pastebin.com/ziALRgx5 -- cluster state
>     >>         http://pastebin.com/63Xm95xM -- index status
>     >>
>     >>         Sorry, they lost their formatting on Pastbin.
>     >>
>     >>         ...Ken
>     >>
>     >>         Kenneth Loafman wrote:
>     >>         > I upgraded to last nights version, restarted, and
>     things are
>     >>         worse.  Now
>     >>         > I have 5 shards hung at recover, not all on the

same

>     node.  Weird.
>     >>         >
>     >>         > I've attached the info you want.  I'll leave
things running
>     >>         for now.
>     >>         >
>     >>         > ...Thanks,
>     >>         > ...Ken
>     >>         >
>     >>         > Shay Banon wrote:
>     >>         >> Do you still have it running? Can you gist the
cluster
>     state
>     >>         and the
>     >>         >> index status results?
>     >>         >>
>     >>         >> I see that you are using master, I have fixed
several
>     things
>     >>         in this
>     >>         >> area, can you pull a new version?
>     >>         >>
>     >>         >> -shay.banon
>     >>         >>
>     >>         >> On Fri, Aug 20, 2010 at 12:33 AM, Kenneth Loafman
>     >>         >> <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
<mailto:kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>> wrote:
>     >>         >>
>     >>         >>     It seems to have started recover, but it's

been

>     7.5 hours
>     >>         and appears to
>     >>         >>     be stopped/hung...
>     >>         >>
>     >>         >>     >                 "1": [
>     >>         >>     >                     {
>     >>         >>     >                         "gateway_recovery":

{

>     >>         >>     >                             "index": {
>     >>         >>     >
>     >>         "expected_recovered_size": "0b",
>     >>         >>     >
>     >>         "expected_recovered_size_in_bytes": 0,
>     >>         >>     >
>     "recovered_size": "0b",
>     >>         >>     >
>     >>         "recovered_size_in_bytes": 0,
>     >>         >>     >
"reused_size": "0b",
>     >>         >>     >
>     "reused_size_in_bytes": 0,
>     >>         >>     >                                 "size":

"0b",

>     >>         >>     >
"size_in_bytes": 0,
>     >>         >>     >
>     "throttling_time": "0s",
>     >>         >>     >
>     >>         "throttling_time_in_millis": 0
>     >>         >>     >                             },
>     >>         >>     >                             "stage":

"RETRY",

>     >>         >>     >
"start_time_in_millis":
>     >>         1282226019603,
>     >>         >>     >

"throttling_time":

>     "7.6h",
>     >>         >>     >
>     >>         "throttling_time_in_millis": 27514627,
>     >>         >>     >                             "time": "7.6h",
>     >>         >>     >

"time_in_millis":

>     27514657,
>     >>         >>     >                             "translog": {
>     >>         >>     >

"recovered": 0

>     >>         >>     >                             }
>     >>         >>     >                         },
>     >>         >>     >                         "index": {
>     >>         >>     >                             "size": "0b",
>     >>         >>     >

"size_in_bytes": 0

>     >>         >>     >                         },
>     >>         >>     >                         "routing": {
>     >>         >>     >                             "index":
"twitter",
>     >>         >>     >                             "node":
>     >>         >>     "031642a1-968f-40fb-b7c2-5a869769d5b4",
>     >>         >>     >                             "primary":

true,

>     >>         >>     >
"relocating_node": null,
>     >>         >>     >                             "shard": 1,
>     >>         >>     >                             "state":
"INITIALIZING"
>     >>         >>     >                         },
>     >>         >>     >                         "state":

"RECOVERING"

>     >>         >>     >                     }
>     >>         >>     >                 ]
>     >>         >>
>     >>         >>
>     >>         >>     Shay Banon wrote:
>     >>         >>     > It should be allocated on the other node,

you

>     shouldn't
>     >>         need to start
>     >>         >>     > another node. When you issue a cluster

health

>     (simple
>     >>         curl can
>     >>         >>     do), what
>     >>         >>     > is the status? The cluster state API gives
you more
>     >>         information if you
>     >>         >>     > are after (each shard and its state).
>     >>         >>     >
>     >>         >>     > On Thu, Aug 19, 2010 at 3:48 PM, Kenneth
Loafman
>     >>         >>     > <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>>> wrote:
>     >>         >>     >
>     >>         >>     >     No this is the first time.  The
shutdown took a
>     >>         while with several
>     >>         >>     >     'Waiting for not to shutdown..." style
>     message.  It
>     >>         came up bad
>     >>         >>     >     after that.
>     >>         >>     >
>     >>         >>     >     So, if I have two nodes now, and one
needs to be
>     >>         recovered,
>     >>         >>     I'll need 3
>     >>         >>     >     nodes to get the recovery done?
>     >>         >>     >
>     >>         >>     >     ...Ken
>     >>         >>     >
>     >>         >>     >     Shay Banon wrote:
>     >>         >>     >     > The shard will allocated to another
node and
>     >>         recovered there. Do
>     >>         >>     >     you see
>     >>         >>     >     > it happen continuously?
>     >>         >>     >     >
>     >>         >>     >     > -shay.banon
>     >>         >>     >     >
>     >>         >>     >     > On Thu, Aug 19, 2010 at 2:28 PM,

Kenneth

>     Loafman
>     >>         >>     >     > <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
<mailto:kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>>
>     >>         >>     >     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>
>     >>         >>     >     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>>>> wrote:
>     >>         >>     >     >
>     >>         >>     >     >     Hi,
>     >>         >>     >     >
>     >>         >>     >     >     The second shard on one of my
indexes has
>     >>         failed due to:
>     >>         >>     >     >     [05:59:47,332][WARN

][index.gateway

>     >>          ] [Mangog]
>     >>         >>     >     [twitter][1]
>     >>         >>     >     >     failed to snapshot on close
>     >>         >>     >     >     ...followed by a long traceback.
>     >>         >>     >     >     ...followed by:
>     >>         >>     >     >     [05:59:49,336][WARN
][cluster.action.shard
>     >>           ] [Mangog]
>     >>         >>     >     received shard
>     >>         >>     >     >     failed for [twitter][1],
>     >>         >>     >
node[86d601df-e124-45ed-a5f2-57d762042d87],
>     >>         >>     >     >     [P], s[INITIALIZING], reason

[Failed

>     to start
>     >>         shard, message
>     >>         >>     >     >
>     >>         [IndexShardGatewayRecoveryException[[twitter][1]
Failed to
>     >>         >>     >     recovery
>     >>         >>     >     >     translog]; nested:
>     >>         >>     EngineCreationFailureException[[twitter][1]
>     >>         >>     >     Failed to
>     >>         >>     >     >     open reader on writer]; nested:
>     >>         >>     >     >
>     >>         >>     >
>     >>         >>
>     >>
>

FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs

>     >>         >>     >     >     (No such file or directory)]; ]]
>     >>         >>     >     >
>     >>         >>     >     >     Is the recovery process
automatic, or do I
>     >>         have to do
>     >>         >>     something
>     >>         >>     >     >     special?  It appears to be just

this

>     one shard.
>     >>         >>     >     >
>     >>         >>     >     >     I use the service wrapper to
start/stop
>     >>         0.9.1-SNAPSHOT,
>     >>         >>     and my
>     >>         >>     >     config is
>     >>         >>     >     >     below.
>     >>         >>     >     >
>     >>         >>     >     >     ...Thanks,
>     >>         >>     >     >     ...Ken
>     >>         >>     >     >
>     >>         >>     >     >     cloud:
>     >>         >>     >     >        aws:
>     >>         >>     >     >            access_key: *****
>     >>         >>     >     >            secret_key: *****
>     >>         >>     >     >
>     >>         >>     >     >     gateway:
>     >>         >>     >     >        type: s3
>     >>         >>     >     >        s3:
>     >>         >>     >     >            bucket: *****
>     >>         >>     >     >
>     >>         >>     >     >     path :
>     >>         >>     >     >        work : /mnt/search-data-dev
>     >>         >>     >     >        logs :
/mnt/search-data-dev/node1/logs
>     >>         >>     >     >
>     >>         >>     >     >     index :
>     >>         >>     >     >        number_of_shards : 2
>     >>         >>     >     >        number_of_replicas : 1
>     >>         >>     >     >
>     >>         >>     >     >     network :
>     >>         >>     >     >        host : 192.168.1.5
>     >>         >>     >     >
>     >>         >>     >     >
>     >>         >>     >
>     >>         >>     >
>     >>         >>
>     >>         >>
>     >>         >
>     >>
>     >>
>     >>
>     >
>
>

(Shay Banon) #18

Also, use the latest again, pushed some more fixes.

On Fri, Aug 20, 2010 at 8:04 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Do you see any exceptions in the logs (failing to start the shard)?

On Fri, Aug 20, 2010 at 8:02 PM, Kenneth Loafman <
kenneth.loafman@gmail.com> wrote:

Now its looping: progress is going to 100, then starting over.

I set up a 1/second loop using:
while /bin/true; do date; curl -XGET
'http://192.168.1.5:9200/twitter/_status?pretty=true'; sleep 1; done
then copied it to gist at: http://gist.github.com/540711

It should have recovered by now, I would think.

...Ken

Shay Banon wrote:

great, ping me if it does not end, I am here to help (we can make it
more interactive on IRC).

p.s. Can you keep the original json format when you gist? Much easier to
know whats going on. You can add pretty=true as a parameter to get it
pretty printed.

-shay.banon

On Fri, Aug 20, 2010 at 5:51 PM, Kenneth Loafman
<kenneth.loafman@gmail.com mailto:kenneth.loafman@gmail.com> wrote:

I think so... Here's the latest on gist

http://gist.github.com/540471

Thanks for the pointer on gist, I've never used it before.

Shay Banon wrote:
> The top just states which shards were queries, a shard that is
still not
> allocated will obviously not be allocated. It seems like its still

in

> recovery process. There are two main APIs to really understand

what is

> going on (except for the high level health api), the cluster state
API,
> that shows you what the cluster wide state is (where each shard is
> supposed to be, what its state is), and the status api which gives

you

> detailed information of the status of each shard allocated on each
node.
>
> Is the recovery progressing?
>
> p.s. Can you use gist instead of pastebin?
>
> -shay.banon
>
> On Fri, Aug 20, 2010 at 5:13 PM, Kenneth Loafman
> <kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
<mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>> wrote:
>
>     I restarted and now 35 of 36 are successful, but if you look
at the
>     status, it's showing multiple shards in recovery.  I'm

confused.

>
>     See cluster status in http://pastebin.com/9qWLf3mk
>
>     Kenneth Loafman wrote:
>     > Will do so in just a bit...
>     >
>     > Shay Banon wrote:
>     >> ... can you test?
>     >>
>     >> On Fri, Aug 20, 2010 at 4:02 PM, Shay Banon
>     >> <shay.banon@elasticsearch.com
<mailto:shay.banon@elasticsearch.com>
>     <mailto:shay.banon@elasticsearch.com
<mailto:shay.banon@elasticsearch.com>>
>     <mailto:shay.banon@elasticsearch.com
<mailto:shay.banon@elasticsearch.com>
>     <mailto:shay.banon@elasticsearch.com
<mailto:shay.banon@elasticsearch.com>>>> wrote:
>     >>
>     >>     Just pushed a fix for this.
>     >>
>     >>
>     >>     On Fri, Aug 20, 2010 at 3:31 PM, Kenneth Loafman
>     >>     <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com> <mailto:

kenneth.loafman@gmail.com

<mailto:kenneth.loafman@gmail.com>>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>> wrote:
>     >>
>     >>         Attachments did not make it.  See:
>     >>         http://pastebin.com/ziALRgx5 -- cluster state
>     >>         http://pastebin.com/63Xm95xM -- index status
>     >>
>     >>         Sorry, they lost their formatting on Pastbin.
>     >>
>     >>         ...Ken
>     >>
>     >>         Kenneth Loafman wrote:
>     >>         > I upgraded to last nights version, restarted, and
>     things are
>     >>         worse.  Now
>     >>         > I have 5 shards hung at recover, not all on the

same

>     node.  Weird.
>     >>         >
>     >>         > I've attached the info you want.  I'll leave
things running
>     >>         for now.
>     >>         >
>     >>         > ...Thanks,
>     >>         > ...Ken
>     >>         >
>     >>         > Shay Banon wrote:
>     >>         >> Do you still have it running? Can you gist the
cluster
>     state
>     >>         and the
>     >>         >> index status results?
>     >>         >>
>     >>         >> I see that you are using master, I have fixed
several
>     things
>     >>         in this
>     >>         >> area, can you pull a new version?
>     >>         >>
>     >>         >> -shay.banon
>     >>         >>
>     >>         >> On Fri, Aug 20, 2010 at 12:33 AM, Kenneth

Loafman

>     >>         >> <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
<mailto:kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com

>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>> wrote:
>     >>         >>
>     >>         >>     It seems to have started recover, but it's

been

>     7.5 hours
>     >>         and appears to
>     >>         >>     be stopped/hung...
>     >>         >>
>     >>         >>     >                 "1": [
>     >>         >>     >                     {
>     >>         >>     >

"gateway_recovery": {

>     >>         >>     >                             "index": {
>     >>         >>     >
>     >>         "expected_recovered_size": "0b",
>     >>         >>     >
>     >>         "expected_recovered_size_in_bytes": 0,
>     >>         >>     >
>     "recovered_size": "0b",
>     >>         >>     >
>     >>         "recovered_size_in_bytes": 0,
>     >>         >>     >
"reused_size": "0b",
>     >>         >>     >
>     "reused_size_in_bytes": 0,
>     >>         >>     >                                 "size":

"0b",

>     >>         >>     >
"size_in_bytes": 0,
>     >>         >>     >
>     "throttling_time": "0s",
>     >>         >>     >
>     >>         "throttling_time_in_millis": 0
>     >>         >>     >                             },
>     >>         >>     >                             "stage":

"RETRY",

>     >>         >>     >
"start_time_in_millis":
>     >>         1282226019603,
>     >>         >>     >

"throttling_time":

>     "7.6h",
>     >>         >>     >
>     >>         "throttling_time_in_millis": 27514627,
>     >>         >>     >                             "time":

"7.6h",

>     >>         >>     >

"time_in_millis":

>     27514657,
>     >>         >>     >                             "translog": {
>     >>         >>     >

"recovered": 0

>     >>         >>     >                             }
>     >>         >>     >                         },
>     >>         >>     >                         "index": {
>     >>         >>     >                             "size": "0b",
>     >>         >>     >

"size_in_bytes": 0

>     >>         >>     >                         },
>     >>         >>     >                         "routing": {
>     >>         >>     >                             "index":
"twitter",
>     >>         >>     >                             "node":
>     >>         >>     "031642a1-968f-40fb-b7c2-5a869769d5b4",
>     >>         >>     >                             "primary":

true,

>     >>         >>     >
"relocating_node": null,
>     >>         >>     >                             "shard": 1,
>     >>         >>     >                             "state":
"INITIALIZING"
>     >>         >>     >                         },
>     >>         >>     >                         "state":

"RECOVERING"

>     >>         >>     >                     }
>     >>         >>     >                 ]
>     >>         >>
>     >>         >>
>     >>         >>     Shay Banon wrote:
>     >>         >>     > It should be allocated on the other node,

you

>     shouldn't
>     >>         need to start
>     >>         >>     > another node. When you issue a cluster

health

>     (simple
>     >>         curl can
>     >>         >>     do), what
>     >>         >>     > is the status? The cluster state API gives
you more
>     >>         information if you
>     >>         >>     > are after (each shard and its state).
>     >>         >>     >
>     >>         >>     > On Thu, Aug 19, 2010 at 3:48 PM, Kenneth
Loafman
>     >>         >>     > <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>>> wrote:
>     >>         >>     >
>     >>         >>     >     No this is the first time.  The
shutdown took a
>     >>         while with several
>     >>         >>     >     'Waiting for not to shutdown..." style
>     message.  It
>     >>         came up bad
>     >>         >>     >     after that.
>     >>         >>     >
>     >>         >>     >     So, if I have two nodes now, and one
needs to be
>     >>         recovered,
>     >>         >>     I'll need 3
>     >>         >>     >     nodes to get the recovery done?
>     >>         >>     >
>     >>         >>     >     ...Ken
>     >>         >>     >
>     >>         >>     >     Shay Banon wrote:
>     >>         >>     >     > The shard will allocated to another
node and
>     >>         recovered there. Do
>     >>         >>     >     you see
>     >>         >>     >     > it happen continuously?
>     >>         >>     >     >
>     >>         >>     >     > -shay.banon
>     >>         >>     >     >
>     >>         >>     >     > On Thu, Aug 19, 2010 at 2:28 PM,

Kenneth

>     Loafman
>     >>         >>     >     > <kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
<mailto:kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com

>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>>
>     >>         >>     >     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>
>     >>         >>     >     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>
>     >>         >>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>
>     >>         <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>
>     <mailto:kenneth.loafman@gmail.com
<mailto:kenneth.loafman@gmail.com>>>>>>> wrote:
>     >>         >>     >     >
>     >>         >>     >     >     Hi,
>     >>         >>     >     >
>     >>         >>     >     >     The second shard on one of my
indexes has
>     >>         failed due to:
>     >>         >>     >     >     [05:59:47,332][WARN

][index.gateway

>     >>          ] [Mangog]
>     >>         >>     >     [twitter][1]
>     >>         >>     >     >     failed to snapshot on close
>     >>         >>     >     >     ...followed by a long traceback.
>     >>         >>     >     >     ...followed by:
>     >>         >>     >     >     [05:59:49,336][WARN
][cluster.action.shard
>     >>           ] [Mangog]
>     >>         >>     >     received shard
>     >>         >>     >     >     failed for [twitter][1],
>     >>         >>     >
node[86d601df-e124-45ed-a5f2-57d762042d87],
>     >>         >>     >     >     [P], s[INITIALIZING], reason

[Failed

>     to start
>     >>         shard, message
>     >>         >>     >     >
>     >>         [IndexShardGatewayRecoveryException[[twitter][1]
Failed to
>     >>         >>     >     recovery
>     >>         >>     >     >     translog]; nested:
>     >>         >>     EngineCreationFailureException[[twitter][1]
>     >>         >>     >     Failed to
>     >>         >>     >     >     open reader on writer]; nested:
>     >>         >>     >     >
>     >>         >>     >
>     >>         >>
>     >>
>

FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs

>     >>         >>     >     >     (No such file or directory)]; ]]
>     >>         >>     >     >
>     >>         >>     >     >     Is the recovery process
automatic, or do I
>     >>         have to do
>     >>         >>     something
>     >>         >>     >     >     special?  It appears to be just

this

>     one shard.
>     >>         >>     >     >
>     >>         >>     >     >     I use the service wrapper to
start/stop
>     >>         0.9.1-SNAPSHOT,
>     >>         >>     and my
>     >>         >>     >     config is
>     >>         >>     >     >     below.
>     >>         >>     >     >
>     >>         >>     >     >     ...Thanks,
>     >>         >>     >     >     ...Ken
>     >>         >>     >     >
>     >>         >>     >     >     cloud:
>     >>         >>     >     >        aws:
>     >>         >>     >     >            access_key: *****
>     >>         >>     >     >            secret_key: *****
>     >>         >>     >     >
>     >>         >>     >     >     gateway:
>     >>         >>     >     >        type: s3
>     >>         >>     >     >        s3:
>     >>         >>     >     >            bucket: *****
>     >>         >>     >     >
>     >>         >>     >     >     path :
>     >>         >>     >     >        work : /mnt/search-data-dev
>     >>         >>     >     >        logs :
/mnt/search-data-dev/node1/logs
>     >>         >>     >     >
>     >>         >>     >     >     index :
>     >>         >>     >     >        number_of_shards : 2
>     >>         >>     >     >        number_of_replicas : 1
>     >>         >>     >     >
>     >>         >>     >     >     network :
>     >>         >>     >     >        host : 192.168.1.5
>     >>         >>     >     >
>     >>         >>     >     >
>     >>         >>     >
>     >>         >>     >
>     >>         >>
>     >>         >>
>     >>         >
>     >>
>     >>
>     >>
>     >
>
>

(Kenneth Loafman) #19

Looks like a file may be missing on the gateway... this repeats in the
log over and over.

[12:10:00,597][WARN ][indices.cluster ] [Magilla] [twitter][1]
failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][1] Failed to recover translog
at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway.recoverTranslog(BlobStoreIndexShardGateway.java:516)
at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway.recover(BlobStoreIndexShardGateway.java:417)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:172)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
Caused by:
org.elasticsearch.index.engine.EngineCreationFailureException:
[twitter][1] Failed to open reader on writer
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:171)
at
org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryPrepareForTranslog(InternalIndexShard.java:405)
at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway.recoverTranslog(BlobStoreIndexShardGateway.java:440)
... 5 more
Caused by: java.io.FileNotFoundException:
/mnt/search-data-dev/elasticsearch/nodes/1/indices/twitter/1/index/_d8g.cfs
(No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:233)
at
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.(SimpleFSDirectory.java:76)
at
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.(SimpleFSDirectory.java:97)
at
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.(NIOFSDirectory.java:87)
at
org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:67)
at
org.elasticsearch.index.store.support.AbstractStore$StoreDirectory.openInput(AbstractStore.java:287)
at
org.apache.lucene.index.CompoundFileReader.(CompoundFileReader.java:67)
at
org.apache.lucene.index.SegmentReader$CoreReaders.(SegmentReader.java:114)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:590)
at
org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:616)
at
org.apache.lucene.index.IndexWriter$ReaderPool.getReadOnlyClone(IndexWriter.java:574)
at
org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:150)
at
org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:36)
at
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:410)
at
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:374)
at
org.elasticsearch.index.engine.robin.RobinEngine.buildNrtResource(RobinEngine.java:538)
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:158)
... 7 more
[12:10:00,605][WARN ][cluster.action.shard ] [Magilla] sending
failed shard for [twitter][1],
node[10dab323-019b-4036-854f-89bb068dcc8d], [P], s[INITIALIZING], reason
[Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] Failed to recover
translog]; nested: EngineCreationFailureException[[twitter][1] Failed to
open reader on writer]; nested:
FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/1/indices/twitter/1/index/_d8g.cfs
(No such file or directory)]; ]]

Shay Banon wrote:

Also, use the latest again, pushed some more fixes.

On Fri, Aug 20, 2010 at 8:04 PM, Shay Banon
<shay.banon@elasticsearch.com mailto:shay.banon@elasticsearch.com> wrote:

Do you see any exceptions in the logs (failing to start the shard)?


On Fri, Aug 20, 2010 at 8:02 PM, Kenneth Loafman
<kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>> wrote:

    Now its looping:  progress is going to 100, then starting over.

    I set up a 1/second loop using:
     while /bin/true; do date; curl -XGET
    'http://192.168.1.5:9200/twitter/_status?pretty=true'; sleep 1; done
    then copied it to gist at: http://gist.github.com/540711

    It should have recovered by now, I would think.

    ...Ken

    Shay Banon wrote:
    > great, ping me if it does not end, I am here to help (we can
    make it
    > more interactive on IRC).
    >
    > p.s. Can you keep the original json format when you gist? Much
    easier to
    > know whats going on. You can add pretty=true as a parameter to
    get it
    > pretty printed.
    >
    > -shay.banon
    >
    > On Fri, Aug 20, 2010 at 5:51 PM, Kenneth Loafman
    > <kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
    <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>> wrote:
    >
    >     I think so... Here's the latest on gist
    http://gist.github.com/540471
    >
    >     Thanks for the pointer on gist, I've never used it before.
    >
    >     Shay Banon wrote:
    >     > The top just states which shards were queries, a shard
    that is
    >     still not
    >     > allocated will obviously not be allocated. It seems like
    its still in
    >     > recovery process. There are two main APIs to really
    understand what is
    >     > going on (except for the high level health api), the
    cluster state
    >     API,
    >     > that shows you what the cluster wide state is (where
    each shard is
    >     > supposed to be, what its state is), and the status api
    which gives you
    >     > detailed information of the status of each shard
    allocated on each
    >     node.
    >     >
    >     > Is the recovery progressing?
    >     >
    >     > p.s. Can you use gist instead of pastebin?
    >     >
    >     > -shay.banon
    >     >
    >     > On Fri, Aug 20, 2010 at 5:13 PM, Kenneth Loafman
    >     > <kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>> wrote:
    >     >
    >     >     I restarted and now 35 of 36 are successful, but if
    you look
    >     at the
    >     >     status, it's showing multiple shards in recovery.
     I'm confused.
    >     >
    >     >     See cluster status in http://pastebin.com/9qWLf3mk
    >     >
    >     >     Kenneth Loafman wrote:
    >     >     > Will do so in just a bit...
    >     >     >
    >     >     > Shay Banon wrote:
    >     >     >> ... can you test?
    >     >     >>
    >     >     >> On Fri, Aug 20, 2010 at 4:02 PM, Shay Banon
    >     >     >> <shay.banon@elasticsearch.com
    <mailto:shay.banon@elasticsearch.com>
    >     <mailto:shay.banon@elasticsearch.com
    <mailto:shay.banon@elasticsearch.com>>
    >     >     <mailto:shay.banon@elasticsearch.com
    <mailto:shay.banon@elasticsearch.com>
    >     <mailto:shay.banon@elasticsearch.com
    <mailto:shay.banon@elasticsearch.com>>>
    >     >     <mailto:shay.banon@elasticsearch.com
    <mailto:shay.banon@elasticsearch.com>
    >     <mailto:shay.banon@elasticsearch.com
    <mailto:shay.banon@elasticsearch.com>>
    >     >     <mailto:shay.banon@elasticsearch.com
    <mailto:shay.banon@elasticsearch.com>
    >     <mailto:shay.banon@elasticsearch.com
    <mailto:shay.banon@elasticsearch.com>>>>> wrote:
    >     >     >>
    >     >     >>     Just pushed a fix for this.
    >     >     >>
    >     >     >>
    >     >     >>     On Fri, Aug 20, 2010 at 3:31 PM, Kenneth Loafman
    >     >     >>     <kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    <mailto:kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>> wrote:
    >     >     >>
    >     >     >>         Attachments did not make it.  See:
    >     >     >>         http://pastebin.com/ziALRgx5 -- cluster state
    >     >     >>         http://pastebin.com/63Xm95xM -- index status
    >     >     >>
    >     >     >>         Sorry, they lost their formatting on Pastbin.
    >     >     >>
    >     >     >>         ...Ken
    >     >     >>
    >     >     >>         Kenneth Loafman wrote:
    >     >     >>         > I upgraded to last nights version,
    restarted, and
    >     >     things are
    >     >     >>         worse.  Now
    >     >     >>         > I have 5 shards hung at recover, not
    all on the same
    >     >     node.  Weird.
    >     >     >>         >
    >     >     >>         > I've attached the info you want.  I'll
    leave
    >     things running
    >     >     >>         for now.
    >     >     >>         >
    >     >     >>         > ...Thanks,
    >     >     >>         > ...Ken
    >     >     >>         >
    >     >     >>         > Shay Banon wrote:
    >     >     >>         >> Do you still have it running? Can you
    gist the
    >     cluster
    >     >     state
    >     >     >>         and the
    >     >     >>         >> index status results?
    >     >     >>         >>
    >     >     >>         >> I see that you are using master, I
    have fixed
    >     several
    >     >     things
    >     >     >>         in this
    >     >     >>         >> area, can you pull a new version?
    >     >     >>         >>
    >     >     >>         >> -shay.banon
    >     >     >>         >>
    >     >     >>         >> On Fri, Aug 20, 2010 at 12:33 AM,
    Kenneth Loafman
    >     >     >>         >> <kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>>> wrote:
    >     >     >>         >>
    >     >     >>         >>     It seems to have started recover,
    but it's been
    >     >     7.5 hours
    >     >     >>         and appears to
    >     >     >>         >>     be stopped/hung...
    >     >     >>         >>
    >     >     >>         >>     >                 "1": [
    >     >     >>         >>     >                     {
    >     >     >>         >>     >                        
    "gateway_recovery": {
    >     >     >>         >>     >                            
    "index": {
    >     >     >>         >>     >
    >     >     >>         "expected_recovered_size": "0b",
    >     >     >>         >>     >
    >     >     >>         "expected_recovered_size_in_bytes": 0,
    >     >     >>         >>     >
    >     >     "recovered_size": "0b",
    >     >     >>         >>     >
    >     >     >>         "recovered_size_in_bytes": 0,
    >     >     >>         >>     >
    >     "reused_size": "0b",
    >     >     >>         >>     >
    >     >     "reused_size_in_bytes": 0,
    >     >     >>         >>     >                                
    "size": "0b",
    >     >     >>         >>     >
    >     "size_in_bytes": 0,
    >     >     >>         >>     >
    >     >     "throttling_time": "0s",
    >     >     >>         >>     >
    >     >     >>         "throttling_time_in_millis": 0
    >     >     >>         >>     >                             },
    >     >     >>         >>     >                            
    "stage": "RETRY",
    >     >     >>         >>     >
    >     "start_time_in_millis":
    >     >     >>         1282226019603,
    >     >     >>         >>     >                            
    "throttling_time":
    >     >     "7.6h",
    >     >     >>         >>     >
    >     >     >>         "throttling_time_in_millis": 27514627,
    >     >     >>         >>     >                            
    "time": "7.6h",
    >     >     >>         >>     >                            
    "time_in_millis":
    >     >     27514657,
    >     >     >>         >>     >                            
    "translog": {
    >     >     >>         >>     >                                
    "recovered": 0
    >     >     >>         >>     >                             }
    >     >     >>         >>     >                         },
    >     >     >>         >>     >                         "index": {
    >     >     >>         >>     >                            
    "size": "0b",
    >     >     >>         >>     >                            
    "size_in_bytes": 0
    >     >     >>         >>     >                         },
    >     >     >>         >>     >                         "routing": {
    >     >     >>         >>     >                             "index":
    >     "twitter",
    >     >     >>         >>     >                             "node":
    >     >     >>         >>    
    "031642a1-968f-40fb-b7c2-5a869769d5b4",
    >     >     >>         >>     >                            
    "primary": true,
    >     >     >>         >>     >
    >     "relocating_node": null,
    >     >     >>         >>     >                            
    "shard": 1,
    >     >     >>         >>     >                             "state":
    >     "INITIALIZING"
    >     >     >>         >>     >                         },
    >     >     >>         >>     >                         "state":
    "RECOVERING"
    >     >     >>         >>     >                     }
    >     >     >>         >>     >                 ]
    >     >     >>         >>
    >     >     >>         >>
    >     >     >>         >>     Shay Banon wrote:
    >     >     >>         >>     > It should be allocated on the
    other node, you
    >     >     shouldn't
    >     >     >>         need to start
    >     >     >>         >>     > another node. When you issue a
    cluster health
    >     >     (simple
    >     >     >>         curl can
    >     >     >>         >>     do), what
    >     >     >>         >>     > is the status? The cluster state
    API gives
    >     you more
    >     >     >>         information if you
    >     >     >>         >>     > are after (each shard and its
    state).
    >     >     >>         >>     >
    >     >     >>         >>     > On Thu, Aug 19, 2010 at 3:48 PM,
    Kenneth
    >     Loafman
    >     >     >>         >>     > <kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>>
    >     >     >>         >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>
    >     >     >>         >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>>>> wrote:
    >     >     >>         >>     >
    >     >     >>         >>     >     No this is the first time.  The
    >     shutdown took a
    >     >     >>         while with several
    >     >     >>         >>     >     'Waiting for not to
    shutdown..." style
    >     >     message.  It
    >     >     >>         came up bad
    >     >     >>         >>     >     after that.
    >     >     >>         >>     >
    >     >     >>         >>     >     So, if I have two nodes now,
    and one
    >     needs to be
    >     >     >>         recovered,
    >     >     >>         >>     I'll need 3
    >     >     >>         >>     >     nodes to get the recovery done?
    >     >     >>         >>     >
    >     >     >>         >>     >     ...Ken
    >     >     >>         >>     >
    >     >     >>         >>     >     Shay Banon wrote:
    >     >     >>         >>     >     > The shard will allocated
    to another
    >     node and
    >     >     >>         recovered there. Do
    >     >     >>         >>     >     you see
    >     >     >>         >>     >     > it happen continuously?
    >     >     >>         >>     >     >
    >     >     >>         >>     >     > -shay.banon
    >     >     >>         >>     >     >
    >     >     >>         >>     >     > On Thu, Aug 19, 2010 at
    2:28 PM, Kenneth
    >     >     Loafman
    >     >     >>         >>     >     > <kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>
    >     >     >>         >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>
    >     >     >>         >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>>>
    >     >     >>         >>     >    
    <mailto:kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>
    >     >     >>         >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>>
    >     >     >>         >>     >    
    <mailto:kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>
    >     >     >>         >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>>>>> wrote:
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     Hi,
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     The second shard on
    one of my
    >     indexes has
    >     >     >>         failed due to:
    >     >     >>         >>     >     >     [05:59:47,332][WARN
    ][index.gateway
    >     >     >>          ] [Mangog]
    >     >     >>         >>     >     [twitter][1]
    >     >     >>         >>     >     >     failed to snapshot on
    close
    >     >     >>         >>     >     >     ...followed by a long
    traceback.
    >     >     >>         >>     >     >     ...followed by:
    >     >     >>         >>     >     >     [05:59:49,336][WARN
    >     ][cluster.action.shard
    >     >     >>           ] [Mangog]
    >     >     >>         >>     >     received shard
    >     >     >>         >>     >     >     failed for [twitter][1],
    >     >     >>         >>     >
    >     node[86d601df-e124-45ed-a5f2-57d762042d87],
    >     >     >>         >>     >     >     [P], s[INITIALIZING],
    reason [Failed
    >     >     to start
    >     >     >>         shard, message
    >     >     >>         >>     >     >
    >     >     >>        
    [IndexShardGatewayRecoveryException[[twitter][1]
    >     Failed to
    >     >     >>         >>     >     recovery
    >     >     >>         >>     >     >     translog]; nested:
    >     >     >>         >>    
    EngineCreationFailureException[[twitter][1]
    >     >     >>         >>     >     Failed to
    >     >     >>         >>     >     >     open reader on
    writer]; nested:
    >     >     >>         >>     >     >
    >     >     >>         >>     >
    >     >     >>         >>
    >     >     >>
    >     >
    >    
    FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs
    >     >     >>         >>     >     >     (No such file or
    directory)]; ]]
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     Is the recovery process
    >     automatic, or do I
    >     >     >>         have to do
    >     >     >>         >>     something
    >     >     >>         >>     >     >     special?  It appears
    to be just this
    >     >     one shard.
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     I use the service
    wrapper to
    >     start/stop
    >     >     >>         0.9.1-SNAPSHOT,
    >     >     >>         >>     and my
    >     >     >>         >>     >     config is
    >     >     >>         >>     >     >     below.
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     ...Thanks,
    >     >     >>         >>     >     >     ...Ken
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     cloud:
    >     >     >>         >>     >     >        aws:
    >     >     >>         >>     >     >            access_key: *****
    >     >     >>         >>     >     >            secret_key: *****
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     gateway:
    >     >     >>         >>     >     >        type: s3
    >     >     >>         >>     >     >        s3:
    >     >     >>         >>     >     >            bucket: *****
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     path :
    >     >     >>         >>     >     >        work :
    /mnt/search-data-dev
    >     >     >>         >>     >     >        logs :
    >     /mnt/search-data-dev/node1/logs
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     index :
    >     >     >>         >>     >     >        number_of_shards : 2
    >     >     >>         >>     >     >        number_of_replicas : 1
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     network :
    >     >     >>         >>     >     >        host : 192.168.1.5
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >
    >     >     >>         >>     >
    >     >     >>         >>     >
    >     >     >>         >>
    >     >     >>         >>
    >     >     >>         >
    >     >     >>
    >     >     >>
    >     >     >>
    >     >     >
    >     >
    >     >
    >
    >

(Shay Banon) #20

It means that the gateway store got corrupted. You will have to rebuild the
index. Probably due to all HEAD changes... . Hopefully its getting stable
now.

-shay.banon

On Fri, Aug 20, 2010 at 8:13 PM, Kenneth Loafman
kenneth.loafman@gmail.comwrote:

Looks like a file may be missing on the gateway... this repeats in the
log over and over.

[12:10:00,597][WARN ][indices.cluster ] [Magilla] [twitter][1]
failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][1] Failed to recover translog
at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway.recoverTranslog(BlobStoreIndexShardGateway.java:516)
at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway.recover(BlobStoreIndexShardGateway.java:417)
at

org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:172)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
Caused by:
org.elasticsearch.index.engine.EngineCreationFailureException:
[twitter][1] Failed to open reader on writer
at

org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:171)
at

org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryPrepareForTranslog(InternalIndexShard.java:405)
at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway.recoverTranslog(BlobStoreIndexShardGateway.java:440)
... 5 more
Caused by: java.io.FileNotFoundException:
/mnt/search-data-dev/elasticsearch/nodes/1/indices/twitter/1/index/_d8g.cfs
(No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:233)
at

org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.(SimpleFSDirectory.java:76)
at

org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.(SimpleFSDirectory.java:97)
at

org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.(NIOFSDirectory.java:87)
at
org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:67)
at

org.elasticsearch.index.store.support.AbstractStore$StoreDirectory.openInput(AbstractStore.java:287)
at

org.apache.lucene.index.CompoundFileReader.(CompoundFileReader.java:67)
at

org.apache.lucene.index.SegmentReader$CoreReaders.(SegmentReader.java:114)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:590)
at
org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:616)
at

org.apache.lucene.index.IndexWriter$ReaderPool.getReadOnlyClone(IndexWriter.java:574)
at
org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:150)
at

org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:36)
at
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:410)
at
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:374)
at

org.elasticsearch.index.engine.robin.RobinEngine.buildNrtResource(RobinEngine.java:538)
at

org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:158)
... 7 more
[12:10:00,605][WARN ][cluster.action.shard ] [Magilla] sending
failed shard for [twitter][1],
node[10dab323-019b-4036-854f-89bb068dcc8d], [P], s[INITIALIZING], reason
[Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] Failed to recover
translog]; nested: EngineCreationFailureException[[twitter][1] Failed to
open reader on writer]; nested:

FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/1/indices/twitter/1/index/_d8g.cfs
(No such file or directory)]; ]]

Shay Banon wrote:

Also, use the latest again, pushed some more fixes.

On Fri, Aug 20, 2010 at 8:04 PM, Shay Banon
<shay.banon@elasticsearch.com mailto:shay.banon@elasticsearch.com>
wrote:

Do you see any exceptions in the logs (failing to start the shard)?


On Fri, Aug 20, 2010 at 8:02 PM, Kenneth Loafman
<kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>>

wrote:

    Now its looping:  progress is going to 100, then starting over.

    I set up a 1/second loop using:
     while /bin/true; do date; curl -XGET
    'http://192.168.1.5:9200/twitter/_status?pretty=true'; sleep 1;

done

    then copied it to gist at: http://gist.github.com/540711

    It should have recovered by now, I would think.

    ...Ken

    Shay Banon wrote:
    > great, ping me if it does not end, I am here to help (we can
    make it
    > more interactive on IRC).
    >
    > p.s. Can you keep the original json format when you gist? Much
    easier to
    > know whats going on. You can add pretty=true as a parameter to
    get it
    > pretty printed.
    >
    > -shay.banon
    >
    > On Fri, Aug 20, 2010 at 5:51 PM, Kenneth Loafman
    > <kenneth.loafman@gmail.com <mailto:kenneth.loafman@gmail.com>
    <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>> wrote:
    >
    >     I think so... Here's the latest on gist
    http://gist.github.com/540471
    >
    >     Thanks for the pointer on gist, I've never used it before.
    >
    >     Shay Banon wrote:
    >     > The top just states which shards were queries, a shard
    that is
    >     still not
    >     > allocated will obviously not be allocated. It seems like
    its still in
    >     > recovery process. There are two main APIs to really
    understand what is
    >     > going on (except for the high level health api), the
    cluster state
    >     API,
    >     > that shows you what the cluster wide state is (where
    each shard is
    >     > supposed to be, what its state is), and the status api
    which gives you
    >     > detailed information of the status of each shard
    allocated on each
    >     node.
    >     >
    >     > Is the recovery progressing?
    >     >
    >     > p.s. Can you use gist instead of pastebin?
    >     >
    >     > -shay.banon
    >     >
    >     > On Fri, Aug 20, 2010 at 5:13 PM, Kenneth Loafman
    >     > <kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>> wrote:
    >     >
    >     >     I restarted and now 35 of 36 are successful, but if
    you look
    >     at the
    >     >     status, it's showing multiple shards in recovery.
     I'm confused.
    >     >
    >     >     See cluster status in http://pastebin.com/9qWLf3mk
    >     >
    >     >     Kenneth Loafman wrote:
    >     >     > Will do so in just a bit...
    >     >     >
    >     >     > Shay Banon wrote:
    >     >     >> ... can you test?
    >     >     >>
    >     >     >> On Fri, Aug 20, 2010 at 4:02 PM, Shay Banon
    >     >     >> <shay.banon@elasticsearch.com
    <mailto:shay.banon@elasticsearch.com>
    >     <mailto:shay.banon@elasticsearch.com
    <mailto:shay.banon@elasticsearch.com>>
    >     >     <mailto:shay.banon@elasticsearch.com
    <mailto:shay.banon@elasticsearch.com>
    >     <mailto:shay.banon@elasticsearch.com
    <mailto:shay.banon@elasticsearch.com>>>
    >     >     <mailto:shay.banon@elasticsearch.com
    <mailto:shay.banon@elasticsearch.com>
    >     <mailto:shay.banon@elasticsearch.com
    <mailto:shay.banon@elasticsearch.com>>
    >     >     <mailto:shay.banon@elasticsearch.com
    <mailto:shay.banon@elasticsearch.com>
    >     <mailto:shay.banon@elasticsearch.com
    <mailto:shay.banon@elasticsearch.com>>>>> wrote:
    >     >     >>
    >     >     >>     Just pushed a fix for this.
    >     >     >>
    >     >     >>
    >     >     >>     On Fri, Aug 20, 2010 at 3:31 PM, Kenneth

Loafman

    >     >     >>     <kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    <mailto:kenneth.loafman@gmail.com <mailto:

kenneth.loafman@gmail.com>

    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>> wrote:
    >     >     >>
    >     >     >>         Attachments did not make it.  See:
    >     >     >>         http://pastebin.com/ziALRgx5 -- cluster

state

    >     >     >>         http://pastebin.com/63Xm95xM -- index

status

    >     >     >>
    >     >     >>         Sorry, they lost their formatting on

Pastbin.

    >     >     >>
    >     >     >>         ...Ken
    >     >     >>
    >     >     >>         Kenneth Loafman wrote:
    >     >     >>         > I upgraded to last nights version,
    restarted, and
    >     >     things are
    >     >     >>         worse.  Now
    >     >     >>         > I have 5 shards hung at recover, not
    all on the same
    >     >     node.  Weird.
    >     >     >>         >
    >     >     >>         > I've attached the info you want.  I'll
    leave
    >     things running
    >     >     >>         for now.
    >     >     >>         >
    >     >     >>         > ...Thanks,
    >     >     >>         > ...Ken
    >     >     >>         >
    >     >     >>         > Shay Banon wrote:
    >     >     >>         >> Do you still have it running? Can you
    gist the
    >     cluster
    >     >     state
    >     >     >>         and the
    >     >     >>         >> index status results?
    >     >     >>         >>
    >     >     >>         >> I see that you are using master, I
    have fixed
    >     several
    >     >     things
    >     >     >>         in this
    >     >     >>         >> area, can you pull a new version?
    >     >     >>         >>
    >     >     >>         >> -shay.banon
    >     >     >>         >>
    >     >     >>         >> On Fri, Aug 20, 2010 at 12:33 AM,
    Kenneth Loafman
    >     >     >>         >> <kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>>> wrote:
    >     >     >>         >>
    >     >     >>         >>     It seems to have started recover,
    but it's been
    >     >     7.5 hours
    >     >     >>         and appears to
    >     >     >>         >>     be stopped/hung...
    >     >     >>         >>
    >     >     >>         >>     >                 "1": [
    >     >     >>         >>     >                     {
    >     >     >>         >>     >
    "gateway_recovery": {
    >     >     >>         >>     >
    "index": {
    >     >     >>         >>     >
    >     >     >>         "expected_recovered_size": "0b",
    >     >     >>         >>     >
    >     >     >>         "expected_recovered_size_in_bytes": 0,
    >     >     >>         >>     >
    >     >     "recovered_size": "0b",
    >     >     >>         >>     >
    >     >     >>         "recovered_size_in_bytes": 0,
    >     >     >>         >>     >
    >     "reused_size": "0b",
    >     >     >>         >>     >
    >     >     "reused_size_in_bytes": 0,
    >     >     >>         >>     >
    "size": "0b",
    >     >     >>         >>     >
    >     "size_in_bytes": 0,
    >     >     >>         >>     >
    >     >     "throttling_time": "0s",
    >     >     >>         >>     >
    >     >     >>         "throttling_time_in_millis": 0
    >     >     >>         >>     >                             },
    >     >     >>         >>     >
    "stage": "RETRY",
    >     >     >>         >>     >
    >     "start_time_in_millis":
    >     >     >>         1282226019603,
    >     >     >>         >>     >
    "throttling_time":
    >     >     "7.6h",
    >     >     >>         >>     >
    >     >     >>         "throttling_time_in_millis": 27514627,
    >     >     >>         >>     >
    "time": "7.6h",
    >     >     >>         >>     >
    "time_in_millis":
    >     >     27514657,
    >     >     >>         >>     >
    "translog": {
    >     >     >>         >>     >
    "recovered": 0
    >     >     >>         >>     >                             }
    >     >     >>         >>     >                         },
    >     >     >>         >>     >                         "index":

{

    >     >     >>         >>     >
    "size": "0b",
    >     >     >>         >>     >
    "size_in_bytes": 0
    >     >     >>         >>     >                         },
    >     >     >>         >>     >

"routing": {

    >     >     >>         >>     >

"index":

    >     "twitter",
    >     >     >>         >>     >

"node":

    >     >     >>         >>
    "031642a1-968f-40fb-b7c2-5a869769d5b4",
    >     >     >>         >>     >
    "primary": true,
    >     >     >>         >>     >
    >     "relocating_node": null,
    >     >     >>         >>     >
    "shard": 1,
    >     >     >>         >>     >

"state":

    >     "INITIALIZING"
    >     >     >>         >>     >                         },
    >     >     >>         >>     >                         "state":
    "RECOVERING"
    >     >     >>         >>     >                     }
    >     >     >>         >>     >                 ]
    >     >     >>         >>
    >     >     >>         >>
    >     >     >>         >>     Shay Banon wrote:
    >     >     >>         >>     > It should be allocated on the
    other node, you
    >     >     shouldn't
    >     >     >>         need to start
    >     >     >>         >>     > another node. When you issue a
    cluster health
    >     >     (simple
    >     >     >>         curl can
    >     >     >>         >>     do), what
    >     >     >>         >>     > is the status? The cluster state
    API gives
    >     you more
    >     >     >>         information if you
    >     >     >>         >>     > are after (each shard and its
    state).
    >     >     >>         >>     >
    >     >     >>         >>     > On Thu, Aug 19, 2010 at 3:48 PM,
    Kenneth
    >     Loafman
    >     >     >>         >>     > <kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>>
    >     >     >>         >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>
    >     >     >>         >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>>>> wrote:
    >     >     >>         >>     >
    >     >     >>         >>     >     No this is the first time.

The

    >     shutdown took a
    >     >     >>         while with several
    >     >     >>         >>     >     'Waiting for not to
    shutdown..." style
    >     >     message.  It
    >     >     >>         came up bad
    >     >     >>         >>     >     after that.
    >     >     >>         >>     >
    >     >     >>         >>     >     So, if I have two nodes now,
    and one
    >     needs to be
    >     >     >>         recovered,
    >     >     >>         >>     I'll need 3
    >     >     >>         >>     >     nodes to get the recovery

done?

    >     >     >>         >>     >
    >     >     >>         >>     >     ...Ken
    >     >     >>         >>     >
    >     >     >>         >>     >     Shay Banon wrote:
    >     >     >>         >>     >     > The shard will allocated
    to another
    >     node and
    >     >     >>         recovered there. Do
    >     >     >>         >>     >     you see
    >     >     >>         >>     >     > it happen continuously?
    >     >     >>         >>     >     >
    >     >     >>         >>     >     > -shay.banon
    >     >     >>         >>     >     >
    >     >     >>         >>     >     > On Thu, Aug 19, 2010 at
    2:28 PM, Kenneth
    >     >     Loafman
    >     >     >>         >>     >     > <kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>
    >     >     >>         >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>
    >     >     >>         >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>>>
    >     >     >>         >>     >
    <mailto:kenneth.loafman@gmail.com <mailto:

kenneth.loafman@gmail.com>

    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>
    >     >     >>         >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>>
    >     >     >>         >>     >
    <mailto:kenneth.loafman@gmail.com <mailto:

kenneth.loafman@gmail.com>

    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>
    >     >     >>         >>     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>
    >     >     >>         <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>
    >     >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>
    >     <mailto:kenneth.loafman@gmail.com
    <mailto:kenneth.loafman@gmail.com>>>>>>>> wrote:
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     Hi,
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     The second shard on
    one of my
    >     indexes has
    >     >     >>         failed due to:
    >     >     >>         >>     >     >     [05:59:47,332][WARN
    ][index.gateway
    >     >     >>          ] [Mangog]
    >     >     >>         >>     >     [twitter][1]
    >     >     >>         >>     >     >     failed to snapshot on
    close
    >     >     >>         >>     >     >     ...followed by a long
    traceback.
    >     >     >>         >>     >     >     ...followed by:
    >     >     >>         >>     >     >     [05:59:49,336][WARN
    >     ][cluster.action.shard
    >     >     >>           ] [Mangog]
    >     >     >>         >>     >     received shard
    >     >     >>         >>     >     >     failed for

[twitter][1],

    >     >     >>         >>     >
    >     node[86d601df-e124-45ed-a5f2-57d762042d87],
    >     >     >>         >>     >     >     [P], s[INITIALIZING],
    reason [Failed
    >     >     to start
    >     >     >>         shard, message
    >     >     >>         >>     >     >
    >     >     >>
    [IndexShardGatewayRecoveryException[[twitter][1]
    >     Failed to
    >     >     >>         >>     >     recovery
    >     >     >>         >>     >     >     translog]; nested:
    >     >     >>         >>
    EngineCreationFailureException[[twitter][1]
    >     >     >>         >>     >     Failed to
    >     >     >>         >>     >     >     open reader on
    writer]; nested:
    >     >     >>         >>     >     >
    >     >     >>         >>     >
    >     >     >>         >>
    >     >     >>
    >     >
    >

FileNotFoundException[/mnt/search-data-dev/elasticsearch/nodes/0/indices/twitter/1/index/_d8g.cfs

    >     >     >>         >>     >     >     (No such file or
    directory)]; ]]
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     Is the recovery process
    >     automatic, or do I
    >     >     >>         have to do
    >     >     >>         >>     something
    >     >     >>         >>     >     >     special?  It appears
    to be just this
    >     >     one shard.
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     I use the service
    wrapper to
    >     start/stop
    >     >     >>         0.9.1-SNAPSHOT,
    >     >     >>         >>     and my
    >     >     >>         >>     >     config is
    >     >     >>         >>     >     >     below.
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     ...Thanks,
    >     >     >>         >>     >     >     ...Ken
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     cloud:
    >     >     >>         >>     >     >        aws:
    >     >     >>         >>     >     >            access_key:

    >     >     >>         >>     >     >            secret_key:

    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     gateway:
    >     >     >>         >>     >     >        type: s3
    >     >     >>         >>     >     >        s3:
    >     >     >>         >>     >     >            bucket: *****
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     path :
    >     >     >>         >>     >     >        work :
    /mnt/search-data-dev
    >     >     >>         >>     >     >        logs :
    >     /mnt/search-data-dev/node1/logs
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     index :
    >     >     >>         >>     >     >        number_of_shards : 2
    >     >     >>         >>     >     >        number_of_replicas :

1

    >     >     >>         >>     >     >
    >     >     >>         >>     >     >     network :
    >     >     >>         >>     >     >        host : 192.168.1.5
    >     >     >>         >>     >     >
    >     >     >>         >>     >     >
    >     >     >>         >>     >
    >     >     >>         >>     >
    >     >     >>         >>
    >     >     >>         >>
    >     >     >>         >
    >     >     >>
    >     >     >>
    >     >     >>
    >     >     >
    >     >
    >     >
    >
    >