Failed to retieve translog exception


(ppearcy) #1

Hello,
I've been semi-frequently getting the error below after stopping and
starting my single machine cluster. I started paying attention to doc
counts and saw that I lost ~250 out of ~300,000 after the restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or searching.
My current usage pattern is pretty much constant content at ~20 docs
per minute entering the system and various boolean queries I'm testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably while
trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a single
machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed aspect
and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there. Memory and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss] [index01][4]
failed to retrieve translog after [1608] operations, ignoring the
rest, considered corrupted
java.io.EOFException
at
org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInput.java:
78)
at
org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:
73)
at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway
$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)


(Shay Banon) #2

You probably got it right, but can you post your config? How many nodes do
you start on the same machine? Do you use memory based indices, since the
recovery should be quick if you use fs based index storage, since it should
be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppearcy@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after stopping and
starting my single machine cluster. I started paying attention to doc
counts and saw that I lost ~250 out of ~300,000 after the restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or searching.
My current usage pattern is pretty much constant content at ~20 docs
per minute entering the system and various boolean queries I'm testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably while
trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a single
machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed aspect
and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there. Memory and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss] [index01][4]
failed to retrieve translog after [1608] operations, ignoring the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInput.java:
78)
at
org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:
73)
at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway
$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)


(ppearcy) #3

Oddly, now that I have set the log level to DEBUG, I haven't been able
to reproduce over the past couple of days. Will revert back to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No exceptions
writing the translog.

Let me get a good capture of this occurring and I will post further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

You probably got it right, but can you post your config? How many nodes do
you start on the same machine? Do you use memory based indices, since the
recovery should be quick if you use fs based index storage, since it should
be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after stopping and
starting my single machine cluster. I started paying attention to doc
counts and saw that I lost ~250 out of ~300,000 after the restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or searching.
My current usage pattern is pretty much constant content at ~20 docs
per minute entering the system and various boolean queries I'm testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably while
trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a single
machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed aspect
and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there. Memory and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss] [index01][4]
failed to retrieve translog after [1608] operations, ignoring the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInput.java:
78)
at
org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:
73)
at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway
$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)


(ppearcy) #4

<nabble_a href="elasticsearch.log">elasticsearch.log</nabble_a><nabble_a href="elasticsearch.yml">elasticsearch.yml</nabble_a>

I've attached a log capturing the issue (this log starts at a fresh creation and the error occurred on my first restart of the cluster). Also, to re-iterate, I am probably using way to many shards (5*49) than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to INFO, not sure if that is pertinent or a red herring, but could indicate a timing issue.

As a side note, it seems that my current cluster start up time is dependent on the amount of transactions in the trans log. When these are large, the start up is delayed applying these transactions. Can take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able to copy off one of the corrupted trans logs, but it is quite large.

If there is anything else I can provide, please let me know.

Thanks!!!


(ppearcy) #5

I posted this via Nabble, as well, and it hasn't gone through, so if
this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elasticsearch.log
http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elasticsearch.yml

I've attached a log capturing the issue (this log starts at a fresh
creation and the error occurred on my first restart of the cluster).
Also, to re-iterate, I am probably using way to many shards (5*49)
than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to INFO,
not sure if that is pertinent or a red herring, but could indicate a
timing issue.

As a side note, it seems that my current cluster start up time is
dependent on the amount of transactions in the trans log. When these
are large, the start up is delayed applying these transactions. Can
take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able to
copy off one of the corrupted trans logs, but it is quite large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't been able
to reproduce over the past couple of days. Will revert back to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No exceptions
writing the translog.

Let me get a good capture of this occurring and I will post further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

You probably got it right, but can you post your config? How many nodes do
you start on the same machine? Do you use memory based indices, since the
recovery should be quick if you use fs based index storage, since it should
be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after stopping and
starting my single machine cluster. I started paying attention to doc
counts and saw that I lost ~250 out of ~300,000 after the restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or searching.
My current usage pattern is pretty much constant content at ~20 docs
per minute entering the system and various boolean queries I'm testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably while
trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a single
machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed aspect
and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there. Memory and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss] [index01][4]
failed to retrieve translog after [1608] operations, ignoring the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp ut.java:
78)
at
org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:
73)
at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway
$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)


(ppearcy) #6

FYI, I seem to be able to reproduce this after indexing a decent chunk
of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the log
level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time for
the transaction logs to be written out in rare cases, probably due to
my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for write
failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through, so if
this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic...http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic...

I've attached a log capturing the issue (this log starts at a fresh
creation and the error occurred on my first restart of the cluster).
Also, to re-iterate, I am probably using way to many shards (5*49)
than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to INFO,
not sure if that is pertinent or a red herring, but could indicate a
timing issue.

As a side note, it seems that my current cluster start up time is
dependent on the amount of transactions in the trans log. When these
are large, the start up is delayed applying these transactions. Can
take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able to
copy off one of the corrupted trans logs, but it is quite large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't been able
to reproduce over the past couple of days. Will revert back to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No exceptions
writing the translog.

Let me get a good capture of this occurring and I will post further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

You probably got it right, but can you post your config? How many nodes do
you start on the same machine? Do you use memory based indices, since the
recovery should be quick if you use fs based index storage, since it should
be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after stopping and
starting my single machine cluster. I started paying attention to doc
counts and saw that I lost ~250 out of ~300,000 after the restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or searching.
My current usage pattern is pretty much constant content at ~20 docs
per minute entering the system and various boolean queries I'm testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably while
trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a single
machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed aspect
and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there. Memory and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss] [index01][4]
failed to retrieve translog after [1608] operations, ignoring the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp ut.java:
78)
at
org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:
73)
at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway
$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)


(Shay Banon) #7

Hi,

How do you shutdown elasticsearch?

-shay.banon

On Tue, Aug 10, 2010 at 8:14 AM, Paul ppearcy@gmail.com wrote:

FYI, I seem to be able to reproduce this after indexing a decent chunk
of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the log
level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time for
the transaction logs to be written out in rare cases, probably due to
my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for write
failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through, so if
this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic...http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic.
..

I've attached a log capturing the issue (this log starts at a fresh
creation and the error occurred on my first restart of the cluster).
Also, to re-iterate, I am probably using way to many shards (5*49)
than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to INFO,
not sure if that is pertinent or a red herring, but could indicate a
timing issue.

As a side note, it seems that my current cluster start up time is
dependent on the amount of transactions in the trans log. When these
are large, the start up is delayed applying these transactions. Can
take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able to
copy off one of the corrupted trans logs, but it is quite large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't been able
to reproduce over the past couple of days. Will revert back to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No exceptions
writing the translog.

Let me get a good capture of this occurring and I will post further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

You probably got it right, but can you post your config? How many
nodes do

you start on the same machine? Do you use memory based indices, since
the

recovery should be quick if you use fs based index storage, since it
should

be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after stopping
and

starting my single machine cluster. I started paying attention to
doc

counts and saw that I lost ~250 out of ~300,000 after the restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or
searching.

My current usage pattern is pretty much constant content at ~20
docs

per minute entering the system and various boolean queries I'm
testing

with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably
while

trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a single
machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed aspect
and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further
information

I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there. Memory
and

CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss]
[index01][4]

failed to retrieve translog after [1608] operations, ignoring the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp
ut.java:

  1. at

org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:

  1. at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway

$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

  1. at java.util.concurrent.ThreadPoolExecutor
    $Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:636)

(ppearcy) #8

I am currently running interactively in a console with the -f option
and ctrl-c to bring things down.

I'll try out the service wrapper and see if same issue occurs there.

Thanks

On Aug 10, 12:11 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

How do you shutdown elasticsearch?

-shay.banon

On Tue, Aug 10, 2010 at 8:14 AM, Paul ppea...@gmail.com wrote:

FYI, I seem to be able to reproduce this after indexing a decent chunk
of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the log
level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time for
the transaction logs to be written out in rare cases, probably due to
my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for write
failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through, so if
this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic....
..

I've attached a log capturing the issue (this log starts at a fresh
creation and the error occurred on my first restart of the cluster).
Also, to re-iterate, I am probably using way to many shards (5*49)
than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to INFO,
not sure if that is pertinent or a red herring, but could indicate a
timing issue.

As a side note, it seems that my current cluster start up time is
dependent on the amount of transactions in the trans log. When these
are large, the start up is delayed applying these transactions. Can
take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able to
copy off one of the corrupted trans logs, but it is quite large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't been able
to reproduce over the past couple of days. Will revert back to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No exceptions
writing the translog.

Let me get a good capture of this occurring and I will post further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

You probably got it right, but can you post your config? How many
nodes do

you start on the same machine? Do you use memory based indices, since
the

recovery should be quick if you use fs based index storage, since it
should

be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after stopping
and

starting my single machine cluster. I started paying attention to
doc

counts and saw that I lost ~250 out of ~300,000 after the restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or
searching.

My current usage pattern is pretty much constant content at ~20
docs

per minute entering the system and various boolean queries I'm
testing

with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably
while

trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a single
machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed aspect
and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further
information

I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there. Memory
and

CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss]
[index01][4]

failed to retrieve translog after [1608] operations, ignoring the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp
ut.java:

  1. at

org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:

  1. at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway

$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

  1. at java.util.concurrent.ThreadPoolExecutor
    $Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:636)

(Shay Banon) #9

I can't find the answer, but do you index while you shutdown the system? If
so, then the latest batch of translog might not have been fsyn'ed to disk
yet. You can use the shutdown API to shutdown the cluster, which is
the preferred way to shutdown a whole cluster (even of size of 1).

-shay.banon

On Tue, Aug 10, 2010 at 6:50 PM, Paul ppearcy@gmail.com wrote:

I am currently running interactively in a console with the -f option
and ctrl-c to bring things down.

I'll try out the service wrapper and see if same issue occurs there.

Thanks

On Aug 10, 12:11 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

How do you shutdown elasticsearch?

-shay.banon

On Tue, Aug 10, 2010 at 8:14 AM, Paul ppea...@gmail.com wrote:

FYI, I seem to be able to reproduce this after indexing a decent chunk
of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the log
level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time for
the transaction logs to be written out in rare cases, probably due to
my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for write
failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through, so if
this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic..
..

..

I've attached a log capturing the issue (this log starts at a fresh
creation and the error occurred on my first restart of the cluster).
Also, to re-iterate, I am probably using way to many shards (5*49)
than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to INFO,
not sure if that is pertinent or a red herring, but could indicate a
timing issue.

As a side note, it seems that my current cluster start up time is
dependent on the amount of transactions in the trans log. When these
are large, the start up is delayed applying these transactions. Can
take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able to
copy off one of the corrupted trans logs, but it is quite large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't been
able

to reproduce over the past couple of days. Will revert back to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No exceptions
writing the translog.

Let me get a good capture of this occurring and I will post further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

You probably got it right, but can you post your config? How many
nodes do

you start on the same machine? Do you use memory based indices,
since

the

recovery should be quick if you use fs based index storage, since
it

should

be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after
stopping

and

starting my single machine cluster. I started paying attention
to

doc

counts and saw that I lost ~250 out of ~300,000 after the
restart

which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or
searching.

My current usage pattern is pretty much constant content at ~20
docs

per minute entering the system and various boolean queries I'm
testing

with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably
while

trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a
single

machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed
aspect

and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further
information

I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there.
Memory

and

CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss]
[index01][4]

failed to retrieve translog after [1608] operations, ignoring
the

rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp

ut.java:

  1. at

org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:

  1. at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway

$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

  1. at java.util.concurrent.ThreadPoolExecutor
    $Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:636)

(ppearcy) #10

In my testing, I have been disabling the process that has been
submitting docs before shutdown and giving it at least 20 seconds.

I will try out the cluster shutdown API, since that seems like the
proper procedure. Will let you know the outcome.

Thank you.

On Aug 10, 10:54 am, Shay Banon shay.ba...@elasticsearch.com wrote:

I can't find the answer, but do you index while you shutdown the system? If
so, then the latest batch of translog might not have been fsyn'ed to disk
yet. You can use the shutdown API to shutdown the cluster, which is
the preferred way to shutdown a whole cluster (even of size of 1).

-shay.banon

On Tue, Aug 10, 2010 at 6:50 PM, Paul ppea...@gmail.com wrote:

I am currently running interactively in a console with the -f option
and ctrl-c to bring things down.

I'll try out the service wrapper and see if same issue occurs there.

Thanks

On Aug 10, 12:11 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

How do you shutdown elasticsearch?

-shay.banon

On Tue, Aug 10, 2010 at 8:14 AM, Paul ppea...@gmail.com wrote:

FYI, I seem to be able to reproduce this after indexing a decent chunk
of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the log
level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time for
the transaction logs to be written out in rare cases, probably due to
my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for write
failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through, so if
this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic..
..

..

I've attached a log capturing the issue (this log starts at a fresh
creation and the error occurred on my first restart of the cluster).
Also, to re-iterate, I am probably using way to many shards (5*49)
than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to INFO,
not sure if that is pertinent or a red herring, but could indicate a
timing issue.

As a side note, it seems that my current cluster start up time is
dependent on the amount of transactions in the trans log. When these
are large, the start up is delayed applying these transactions. Can
take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able to
copy off one of the corrupted trans logs, but it is quite large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't been
able

to reproduce over the past couple of days. Will revert back to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No exceptions
writing the translog.

Let me get a good capture of this occurring and I will post further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

You probably got it right, but can you post your config? How many
nodes do

you start on the same machine? Do you use memory based indices,
since

the

recovery should be quick if you use fs based index storage, since
it

should

be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after
stopping

and

starting my single machine cluster. I started paying attention
to

doc

counts and saw that I lost ~250 out of ~300,000 after the
restart

which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or
searching.

My current usage pattern is pretty much constant content at ~20
docs

per minute entering the system and various boolean queries I'm
testing

with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably
while

trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a
single

machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed
aspect

and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further
information

I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there.
Memory

and

CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss]
[index01][4]

failed to retrieve translog after [1608] operations, ignoring
the

rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp

ut.java:

  1. at

org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:

  1. at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway

$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

  1. at java.util.concurrent.ThreadPoolExecutor
    $Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:636)

(ppearcy) #11

Unless I am missing something, I can't get the REST shutdown request
to work. Using a 0.9.1 snapshot from last week.

http://localhost:9200/_cluster/nodes/_shutdown
returns this json:
{

* cluster_name: "elasticsearch"
* nodes: { }

}

This command:
http://localhost:9200/_cluster/nodes/_all/_shutdown

returns:
No handler found for uri [/_cluster/nodes/_all/_shutdown] and method
[GET]

Thanks

On Aug 10, 11:42 am, Paul ppea...@gmail.com wrote:

In my testing, I have been disabling the process that has been
submitting docs before shutdown and giving it at least 20 seconds.

I will try out the cluster shutdown API, since that seems like the
proper procedure. Will let you know the outcome.

Thank you.

On Aug 10, 10:54 am, Shay Banon shay.ba...@elasticsearch.com wrote:

I can't find the answer, but do you index while you shutdown the system? If
so, then the latest batch of translog might not have been fsyn'ed to disk
yet. You can use the shutdown API to shutdown the cluster, which is
the preferred way to shutdown a whole cluster (even of size of 1).

-shay.banon

On Tue, Aug 10, 2010 at 6:50 PM, Paul ppea...@gmail.com wrote:

I am currently running interactively in a console with the -f option
and ctrl-c to bring things down.

I'll try out the service wrapper and see if same issue occurs there.

Thanks

On Aug 10, 12:11 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

How do you shutdown elasticsearch?

-shay.banon

On Tue, Aug 10, 2010 at 8:14 AM, Paul ppea...@gmail.com wrote:

FYI, I seem to be able to reproduce this after indexing a decent chunk
of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the log
level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time for
the transaction logs to be written out in rare cases, probably due to
my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for write
failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through, so if
this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic..
..

..

I've attached a log capturing the issue (this log starts at a fresh
creation and the error occurred on my first restart of the cluster).
Also, to re-iterate, I am probably using way to many shards (5*49)
than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to INFO,
not sure if that is pertinent or a red herring, but could indicate a
timing issue.

As a side note, it seems that my current cluster start up time is
dependent on the amount of transactions in the trans log. When these
are large, the start up is delayed applying these transactions. Can
take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able to
copy off one of the corrupted trans logs, but it is quite large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't been
able

to reproduce over the past couple of days. Will revert back to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No exceptions
writing the translog.

Let me get a good capture of this occurring and I will post further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

You probably got it right, but can you post your config? How many
nodes do

you start on the same machine? Do you use memory based indices,
since

the

recovery should be quick if you use fs based index storage, since
it

should

be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after
stopping

and

starting my single machine cluster. I started paying attention
to

doc

counts and saw that I lost ~250 out of ~300,000 after the
restart

which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or
searching.

My current usage pattern is pretty much constant content at ~20
docs

per minute entering the system and various boolean queries I'm
testing

with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably
while

trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a
single

machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed
aspect

and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further
information

I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there.
Memory

and

CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss]
[index01][4]

failed to retrieve translog after [1608] operations, ignoring
the

rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp

ut.java:

  1. at

org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:

  1. at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway

$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

  1. at java.util.concurrent.ThreadPoolExecutor
    $Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:636)

(Shay Banon) #12

I think you use GET on the below url, you should use POST (curl -XPOST
localhost:9200/_cluster/nodes/_shutdown)

-shay.banon

On Wed, Aug 11, 2010 at 9:08 AM, Paul ppearcy@gmail.com wrote:

Unless I am missing something, I can't get the REST shutdown request
to work. Using a 0.9.1 snapshot from last week.

http://localhost:9200/_cluster/nodes/_shutdown
returns this json:
{

  • cluster_name: "elasticsearch"
  • nodes: { }
    }

This command:
http://localhost:9200/_cluster/nodes/_all/_shutdown

returns:
No handler found for uri [/_cluster/nodes/_all/_shutdown] and method
[GET]

Thanks

On Aug 10, 11:42 am, Paul ppea...@gmail.com wrote:

In my testing, I have been disabling the process that has been
submitting docs before shutdown and giving it at least 20 seconds.

I will try out the cluster shutdown API, since that seems like the
proper procedure. Will let you know the outcome.

Thank you.

On Aug 10, 10:54 am, Shay Banon shay.ba...@elasticsearch.com wrote:

I can't find the answer, but do you index while you shutdown the
system? If

so, then the latest batch of translog might not have been fsyn'ed to
disk

yet. You can use the shutdown API to shutdown the cluster, which is
the preferred way to shutdown a whole cluster (even of size of 1).

-shay.banon

On Tue, Aug 10, 2010 at 6:50 PM, Paul ppea...@gmail.com wrote:

I am currently running interactively in a console with the -f option
and ctrl-c to bring things down.

I'll try out the service wrapper and see if same issue occurs there.

Thanks

On Aug 10, 12:11 am, Shay Banon shay.ba...@elasticsearch.com
wrote:

Hi,

How do you shutdown elasticsearch?

-shay.banon

On Tue, Aug 10, 2010 at 8:14 AM, Paul ppea...@gmail.com wrote:

FYI, I seem to be able to reproduce this after indexing a decent
chunk

of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the
log

level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time
for

the transaction logs to be written out in rare cases, probably
due to

my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for
write

failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through,
so if

this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic..

..

..

I've attached a log capturing the issue (this log starts at a
fresh

creation and the error occurred on my first restart of the
cluster).

Also, to re-iterate, I am probably using way to many shards
(5*49)

than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to
INFO,

not sure if that is pertinent or a red herring, but could
indicate a

timing issue.

As a side note, it seems that my current cluster start up time
is

dependent on the amount of transactions in the trans log. When
these

are large, the start up is delayed applying these transactions.
Can

take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able
to

copy off one of the corrupted trans logs, but it is quite
large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't
been

able

to reproduce over the past couple of days. Will revert back
to the

default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No
exceptions

writing the translog.

Let me get a good capture of this occurring and I will post
further

details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

You probably got it right, but can you post your config?
How many

nodes do

you start on the same machine? Do you use memory based
indices,

since

the

recovery should be quick if you use fs based index storage,
since

it

should

be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com
wrote:

Hello,
I've been semi-frequently getting the error below after
stopping

and

starting my single machine cluster. I started paying
attention

to

doc

counts and saw that I lost ~250 out of ~300,000 after the
restart

which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing
or

searching.

My current usage pattern is pretty much constant content
at ~20

docs

per minute entering the system and various boolean
queries I'm

testing

with from a single thread.

On startup, the cluster stayed in red for ~15 minutes,
probably

while

trying to restore the trans log. My gateway is the local
fs.

I'm running a config with 49 indexes w/ 5 shards each on
a

single

machine, which may be pushing something over the edge.
We're

evaluating a single machine before moving to the
distributed

aspect

and was hoping not to have to rebuild to add extra
shards.

Any ideas what could be the cause? If there is any
further

information

I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything
there.

Memory

and

CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss]
[index01][4]

failed to retrieve translog after [1608] operations,
ignoring

the

rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp

ut.java:

  1. at

org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:

  1. at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway

$3.onPartial(BlobStoreIndexShardGateway.java:416)
at

org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer

$1.run(AbstractFsBlobContainer.java:82)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

  1. at java.util.concurrent.ThreadPoolExecutor
    $Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:636)

(ppearcy) #13

Thank you. I have not been able to reproduce shutting down through the
proper cluster shutdown API or the service wrapper (although, I am
guessing servicewrapper probably goes through a very similar, if not
identical path than ctrl-c).

I'll keep you updated, but I feel thing are looking good.

On Aug 11, 1:57 am, Shay Banon shay.ba...@elasticsearch.com wrote:

I think you use GET on the below url, you should use POST (curl -XPOST
localhost:9200/_cluster/nodes/_shutdown)

-shay.banon

On Wed, Aug 11, 2010 at 9:08 AM, Paul ppea...@gmail.com wrote:

Unless I am missing something, I can't get the REST shutdown request
to work. Using a 0.9.1 snapshot from last week.

http://localhost:9200/_cluster/nodes/_shutdown
returns this json:
{

  • cluster_name: "elasticsearch"
  • nodes: { }
    }

This command:
http://localhost:9200/_cluster/nodes/_all/_shutdown

returns:
No handler found for uri [/_cluster/nodes/_all/_shutdown] and method
[GET]

Thanks

On Aug 10, 11:42 am, Paul ppea...@gmail.com wrote:

In my testing, I have been disabling the process that has been
submitting docs before shutdown and giving it at least 20 seconds.

I will try out the cluster shutdown API, since that seems like the
proper procedure. Will let you know the outcome.

Thank you.

On Aug 10, 10:54 am, Shay Banon shay.ba...@elasticsearch.com wrote:

I can't find the answer, but do you index while you shutdown the
system? If

so, then the latest batch of translog might not have been fsyn'ed to
disk

yet. You can use the shutdown API to shutdown the cluster, which is
the preferred way to shutdown a whole cluster (even of size of 1).

-shay.banon

On Tue, Aug 10, 2010 at 6:50 PM, Paul ppea...@gmail.com wrote:

I am currently running interactively in a console with the -f option
and ctrl-c to bring things down.

I'll try out the service wrapper and see if same issue occurs there.

Thanks

On Aug 10, 12:11 am, Shay Banon shay.ba...@elasticsearch.com
wrote:

Hi,

How do you shutdown elasticsearch?

-shay.banon

On Tue, Aug 10, 2010 at 8:14 AM, Paul ppea...@gmail.com wrote:

FYI, I seem to be able to reproduce this after indexing a decent
chunk

of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the
log

level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time
for

the transaction logs to be written out in rare cases, probably
due to

my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for
write

failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through,
so if

this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic..

..

..

I've attached a log capturing the issue (this log starts at a
fresh

creation and the error occurred on my first restart of the
cluster).

Also, to re-iterate, I am probably using way to many shards
(5*49)

than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to
INFO,

not sure if that is pertinent or a red herring, but could
indicate a

timing issue.

As a side note, it seems that my current cluster start up time
is

dependent on the amount of transactions in the trans log. When
these

are large, the start up is delayed applying these transactions.
Can

take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able
to

copy off one of the corrupted trans logs, but it is quite
large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't
been

able

to reproduce over the past couple of days. Will revert back
to the

default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No
exceptions

writing the translog.

Let me get a good capture of this occurring and I will post
further

details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

You probably got it right, but can you post your config?
How many

nodes do

you start on the same machine? Do you use memory based
indices,

since

the

recovery should be quick if you use fs based index storage,
since

it

should

be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com
wrote:

Hello,
I've been semi-frequently getting the error below after
stopping

and

starting my single machine cluster. I started paying
attention

to

doc

counts and saw that I lost ~250 out of ~300,000 after the
restart

which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing
or

searching.

My current usage pattern is pretty much constant content
at ~20

docs

per minute entering the system and various boolean
queries I'm

testing

with from a single thread.

On startup, the cluster stayed in red for ~15 minutes,
probably

while

trying to restore the trans log. My gateway is the local
fs.

I'm running a config with 49 indexes w/ 5 shards each on
a

single

machine, which may be pushing something over the edge.
We're

evaluating a single machine before moving to the
distributed

aspect

and was hoping not to have to rebuild to add extra
shards.

Any ideas what could be the cause? If there is any
further

information

I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything
there.

Memory

and

CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss]
[index01][4]

failed to retrieve translog after [1608] operations,
ignoring

the

rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp

ut.java:

  1. at

org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:

  1. at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway

$3.onPartial(BlobStoreIndexShardGateway.java:416)
at

org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer

$1.run(AbstractFsBlobContainer.java:82)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

  1. at java.util.concurrent.ThreadPoolExecutor
    $Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:636)

(ppearcy) #14

Yeah, I don't believe that the issue exists with the service wrapper
or the cluster shutdown processes, which is great.

I will keep a lookout going forward, but don't consider this an issue.

Thanks,
Paul

On Aug 11, 10:01 am, Paul ppea...@gmail.com wrote:

Thank you. I have not been able to reproduce shutting down through the
proper cluster shutdown API or the service wrapper (although, I am
guessing servicewrapper probably goes through a very similar, if not
identical path than ctrl-c).

I'll keep you updated, but I feel thing are looking good.

On Aug 11, 1:57 am, Shay Banon shay.ba...@elasticsearch.com wrote:

I think you use GET on the below url, you should use POST (curl -XPOST
localhost:9200/_cluster/nodes/_shutdown)

-shay.banon

On Wed, Aug 11, 2010 at 9:08 AM, Paul ppea...@gmail.com wrote:

Unless I am missing something, I can't get the REST shutdown request
to work. Using a 0.9.1 snapshot from last week.

http://localhost:9200/_cluster/nodes/_shutdown
returns this json:
{

  • cluster_name: "elasticsearch"
  • nodes: { }
    }

This command:
http://localhost:9200/_cluster/nodes/_all/_shutdown

returns:
No handler found for uri [/_cluster/nodes/_all/_shutdown] and method
[GET]

Thanks

On Aug 10, 11:42 am, Paul ppea...@gmail.com wrote:

In my testing, I have been disabling the process that has been
submitting docs before shutdown and giving it at least 20 seconds.

I will try out the cluster shutdown API, since that seems like the
proper procedure. Will let you know the outcome.

Thank you.

On Aug 10, 10:54 am, Shay Banon shay.ba...@elasticsearch.com wrote:

I can't find the answer, but do you index while you shutdown the
system? If

so, then the latest batch of translog might not have been fsyn'ed to
disk

yet. You can use the shutdown API to shutdown the cluster, which is
the preferred way to shutdown a whole cluster (even of size of 1).

-shay.banon

On Tue, Aug 10, 2010 at 6:50 PM, Paul ppea...@gmail.com wrote:

I am currently running interactively in a console with the -f option
and ctrl-c to bring things down.

I'll try out the service wrapper and see if same issue occurs there.

Thanks

On Aug 10, 12:11 am, Shay Banon shay.ba...@elasticsearch.com
wrote:

Hi,

How do you shutdown elasticsearch?

-shay.banon

On Tue, Aug 10, 2010 at 8:14 AM, Paul ppea...@gmail.com wrote:

FYI, I seem to be able to reproduce this after indexing a decent
chunk

of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the
log

level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time
for

the transaction logs to be written out in rare cases, probably
due to

my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for
write

failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through,
so if

this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic..

..

..

I've attached a log capturing the issue (this log starts at a
fresh

creation and the error occurred on my first restart of the
cluster).

Also, to re-iterate, I am probably using way to many shards
(5*49)

than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to
INFO,

not sure if that is pertinent or a red herring, but could
indicate a

timing issue.

As a side note, it seems that my current cluster start up time
is

dependent on the amount of transactions in the trans log. When
these

are large, the start up is delayed applying these transactions.
Can

take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able
to

copy off one of the corrupted trans logs, but it is quite
large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't
been

able

to reproduce over the past couple of days. Will revert back
to the

default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No
exceptions

writing the translog.

Let me get a good capture of this occurring and I will post
further

details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

You probably got it right, but can you post your config?
How many

nodes do

you start on the same machine? Do you use memory based
indices,

since

the

recovery should be quick if you use fs based index storage,
since

it

should

be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com
wrote:

Hello,
I've been semi-frequently getting the error below after
stopping

and

starting my single machine cluster. I started paying
attention

to

doc

counts and saw that I lost ~250 out of ~300,000 after the
restart

which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing
or

searching.

My current usage pattern is pretty much constant content
at ~20

docs

per minute entering the system and various boolean
queries I'm

testing

with from a single thread.

On startup, the cluster stayed in red for ~15 minutes,
probably

while

trying to restore the trans log. My gateway is the local
fs.

I'm running a config with 49 indexes w/ 5 shards each on
a

single

machine, which may be pushing something over the edge.
We're

evaluating a single machine before moving to the
distributed

aspect

and was hoping not to have to rebuild to add extra
shards.

Any ideas what could be the cause? If there is any
further

information

I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything
there.

Memory

and

CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss]
[index01][4]

failed to retrieve translog after [1608] operations,
ignoring

the

rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp

ut.java:

  1. at

org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:

  1. at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway

$3.onPartial(BlobStoreIndexShardGateway.java:416)
at

org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer

$1.run(AbstractFsBlobContainer.java:82)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

  1. at java.util.concurrent.ThreadPoolExecutor
    $Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:636)

(system) #15