Failed to retieve translog exception

ppearcy · August 7, 2010, 1:35am

Hello,
I've been semi-frequently getting the error below after stopping and
starting my single machine cluster. I started paying attention to doc
counts and saw that I lost ~250 out of ~300,000 after the restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or searching.
My current usage pattern is pretty much constant content at ~20 docs
per minute entering the system and various boolean queries I'm testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably while
trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a single
machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed aspect
and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there. Memory and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss] [index01][4]
failed to retrieve translog after [1608] operations, ignoring the
rest, considered corrupted
java.io.EOFException
at
org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInput.java:
78)
at
org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:
73)
at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway
$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

kimchy · August 8, 2010, 9:25pm

You probably got it right, but can you post your config? How many nodes do
you start on the same machine? Do you use memory based indices, since the
recovery should be quick if you use fs based index storage, since it should
be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppearcy@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after stopping and
starting my single machine cluster. I started paying attention to doc
counts and saw that I lost ~250 out of ~300,000 after the restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or searching.
My current usage pattern is pretty much constant content at ~20 docs
per minute entering the system and various boolean queries I'm testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably while
trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a single
machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed aspect
and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there. Memory and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss] [index01][4]
failed to retrieve translog after [1608] operations, ignoring the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInput.java:
78)
at
org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:
73)
at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway
$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

ppearcy · August 8, 2010, 10:06pm

Oddly, now that I have set the log level to DEBUG, I haven't been able
to reproduce over the past couple of days. Will revert back to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No exceptions
writing the translog.

Let me get a good capture of this occurring and I will post further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

You probably got it right, but can you post your config? How many nodes do
you start on the same machine? Do you use memory based indices, since the
recovery should be quick if you use fs based index storage, since it should
be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after stopping and
starting my single machine cluster. I started paying attention to doc
counts and saw that I lost ~250 out of ~300,000 after the restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or searching.
My current usage pattern is pretty much constant content at ~20 docs
per minute entering the system and various boolean queries I'm testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably while
trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a single
machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed aspect
and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there. Memory and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss] [index01][4]
failed to retrieve translog after [1608] operations, ignoring the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInput.java:
78)
at
org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:
73)
at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway
$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

ppearcy · August 9, 2010, 9:28pm

<nabble_a href="elasticsearch.log">elasticsearch.log</nabble_a><nabble_a href="elasticsearch.yml">elasticsearch.yml</nabble_a>

I've attached a log capturing the issue (this log starts at a fresh creation and the error occurred on my first restart of the cluster). Also, to re-iterate, I am probably using way to many shards (5*49) than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to INFO, not sure if that is pertinent or a red herring, but could indicate a timing issue.

As a side note, it seems that my current cluster start up time is dependent on the amount of transactions in the trans log. When these are large, the start up is delayed applying these transactions. Can take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able to copy off one of the corrupted trans logs, but it is quite large.

If there is anything else I can provide, please let me know.

Thanks!!!

ppearcy · August 9, 2010, 10:03pm

I posted this via Nabble, as well, and it hasn't gone through, so if
this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elasticsearch.log
http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elasticsearch.yml

I've attached a log capturing the issue (this log starts at a fresh
creation and the error occurred on my first restart of the cluster).
Also, to re-iterate, I am probably using way to many shards (5*49)
than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to INFO,
not sure if that is pertinent or a red herring, but could indicate a
timing issue.

As a side note, it seems that my current cluster start up time is
dependent on the amount of transactions in the trans log. When these
are large, the start up is delayed applying these transactions. Can
take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able to
copy off one of the corrupted trans logs, but it is quite large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't been able
to reproduce over the past couple of days. Will revert back to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No exceptions
writing the translog.

Let me get a good capture of this occurring and I will post further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

You probably got it right, but can you post your config? How many nodes do
you start on the same machine? Do you use memory based indices, since the
recovery should be quick if you use fs based index storage, since it should
be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after stopping and
starting my single machine cluster. I started paying attention to doc
counts and saw that I lost ~250 out of ~300,000 after the restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or searching.
My current usage pattern is pretty much constant content at ~20 docs
per minute entering the system and various boolean queries I'm testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably while
trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a single
machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed aspect
and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there. Memory and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss] [index01][4]
failed to retrieve translog after [1608] operations, ignoring the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp ut.java:
78)
at
org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:
73)
at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway
$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

ppearcy · August 10, 2010, 5:14am

FYI, I seem to be able to reproduce this after indexing a decent chunk
of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the log
level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time for
the transaction logs to be written out in rare cases, probably due to
my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for write
failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through, so if
this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic...http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic...

I've attached a log capturing the issue (this log starts at a fresh
creation and the error occurred on my first restart of the cluster).
Also, to re-iterate, I am probably using way to many shards (5*49)
than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to INFO,
not sure if that is pertinent or a red herring, but could indicate a
timing issue.

As a side note, it seems that my current cluster start up time is
dependent on the amount of transactions in the trans log. When these
are large, the start up is delayed applying these transactions. Can
take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able to
copy off one of the corrupted trans logs, but it is quite large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't been able
to reproduce over the past couple of days. Will revert back to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No exceptions
writing the translog.

Let me get a good capture of this occurring and I will post further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

You probably got it right, but can you post your config? How many nodes do
you start on the same machine? Do you use memory based indices, since the
recovery should be quick if you use fs based index storage, since it should
be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after stopping and
starting my single machine cluster. I started paying attention to doc
counts and saw that I lost ~250 out of ~300,000 after the restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or searching.
My current usage pattern is pretty much constant content at ~20 docs
per minute entering the system and various boolean queries I'm testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably while
trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a single
machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed aspect
and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there. Memory and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss] [index01][4]
failed to retrieve translog after [1608] operations, ignoring the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp ut.java:
78)
at
org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:
73)
at
org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway
$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

kimchy · August 10, 2010, 6:11am

Hi,

How do you shutdown elasticsearch?

-shay.banon

On Tue, Aug 10, 2010 at 8:14 AM, Paul ppearcy@gmail.com wrote:

FYI, I seem to be able to reproduce this after indexing a decent chunk
of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the log
level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time for
the transaction logs to be written out in rare cases, probably due to
my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for write
failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through, so if
this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic...http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic.
..

I've attached a log capturing the issue (this log starts at a fresh
creation and the error occurred on my first restart of the cluster).
Also, to re-iterate, I am probably using way to many shards (5*49)
than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to INFO,
not sure if that is pertinent or a red herring, but could indicate a
timing issue.

As a side note, it seems that my current cluster start up time is
dependent on the amount of transactions in the trans log. When these
are large, the start up is delayed applying these transactions. Can
take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able to
copy off one of the corrupted trans logs, but it is quite large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't been able
to reproduce over the past couple of days. Will revert back to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No exceptions
writing the translog.

Let me get a good capture of this occurring and I will post further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

You probably got it right, but can you post your config? How many
nodes do
you start on the same machine? Do you use memory based indices, since
the
recovery should be quick if you use fs based index storage, since it
should
be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after stopping
and
starting my single machine cluster. I started paying attention to
doc
counts and saw that I lost ~250 out of ~300,000 after the restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or
searching.
My current usage pattern is pretty much constant content at ~20
docs
per minute entering the system and various boolean queries I'm
testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably
while
trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a single
machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed aspect
and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further
information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there. Memory
and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss]
[index01][4]
failed to retrieve translog after [1608] operations, ignoring the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp
ut.java:

at

org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:

at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway

$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

ppearcy · August 10, 2010, 3:50pm

I am currently running interactively in a console with the -f option
and ctrl-c to bring things down.

I'll try out the service wrapper and see if same issue occurs there.

Thanks

On Aug 10, 12:11 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

How do you shutdown elasticsearch?

-shay.banon

On Tue, Aug 10, 2010 at 8:14 AM, Paul ppea...@gmail.com wrote:

FYI, I seem to be able to reproduce this after indexing a decent chunk
of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the log
level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time for
the transaction logs to be written out in rare cases, probably due to
my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for write
failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through, so if
this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic....
..

I've attached a log capturing the issue (this log starts at a fresh
creation and the error occurred on my first restart of the cluster).
Also, to re-iterate, I am probably using way to many shards (5*49)
than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to INFO,
not sure if that is pertinent or a red herring, but could indicate a
timing issue.

As a side note, it seems that my current cluster start up time is
dependent on the amount of transactions in the trans log. When these
are large, the start up is delayed applying these transactions. Can
take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able to
copy off one of the corrupted trans logs, but it is quite large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't been able
to reproduce over the past couple of days. Will revert back to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No exceptions
writing the translog.

Let me get a good capture of this occurring and I will post further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

You probably got it right, but can you post your config? How many
nodes do
you start on the same machine? Do you use memory based indices, since
the
recovery should be quick if you use fs based index storage, since it
should
be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after stopping
and
starting my single machine cluster. I started paying attention to
doc
counts and saw that I lost ~250 out of ~300,000 after the restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or
searching.
My current usage pattern is pretty much constant content at ~20
docs
per minute entering the system and various boolean queries I'm
testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably
while
trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a single
machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed aspect
and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further
information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there. Memory
and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss]
[index01][4]
failed to retrieve translog after [1608] operations, ignoring the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp
ut.java:

at

org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:

at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway

$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

kimchy · August 10, 2010, 4:54pm

I can't find the answer, but do you index while you shutdown the system? If
so, then the latest batch of translog might not have been fsyn'ed to disk
yet. You can use the shutdown API to shutdown the cluster, which is
the preferred way to shutdown a whole cluster (even of size of 1).

-shay.banon

On Tue, Aug 10, 2010 at 6:50 PM, Paul ppearcy@gmail.com wrote:

I am currently running interactively in a console with the -f option
and ctrl-c to bring things down.

I'll try out the service wrapper and see if same issue occurs there.

Thanks

On Aug 10, 12:11 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

How do you shutdown elasticsearch?

-shay.banon

On Tue, Aug 10, 2010 at 8:14 AM, Paul ppea...@gmail.com wrote:

FYI, I seem to be able to reproduce this after indexing a decent chunk
of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the log
level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time for
the transaction logs to be written out in rare cases, probably due to
my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for write
failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through, so if
this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic..
..
..

I've attached a log capturing the issue (this log starts at a fresh
creation and the error occurred on my first restart of the cluster).
Also, to re-iterate, I am probably using way to many shards (5*49)
than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to INFO,
not sure if that is pertinent or a red herring, but could indicate a
timing issue.

As a side note, it seems that my current cluster start up time is
dependent on the amount of transactions in the trans log. When these
are large, the start up is delayed applying these transactions. Can
take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able to
copy off one of the corrupted trans logs, but it is quite large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't been
able
to reproduce over the past couple of days. Will revert back to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No exceptions
writing the translog.

Let me get a good capture of this occurring and I will post further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

You probably got it right, but can you post your config? How many
nodes do
you start on the same machine? Do you use memory based indices,
since
the
recovery should be quick if you use fs based index storage, since
it
should
be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after
stopping
and
starting my single machine cluster. I started paying attention
to
doc
counts and saw that I lost ~250 out of ~300,000 after the
restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or
searching.
My current usage pattern is pretty much constant content at ~20
docs
per minute entering the system and various boolean queries I'm
testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably
while
trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a
single
machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed
aspect
and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further
information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there.
Memory
and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss]
[index01][4]
failed to retrieve translog after [1608] operations, ignoring
the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp

ut.java:

at

org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:

at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway

$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

ppearcy · August 10, 2010, 5:42pm

In my testing, I have been disabling the process that has been
submitting docs before shutdown and giving it at least 20 seconds.

I will try out the cluster shutdown API, since that seems like the
proper procedure. Will let you know the outcome.

Thank you.

On Aug 10, 10:54 am, Shay Banon shay.ba...@elasticsearch.com wrote:

I can't find the answer, but do you index while you shutdown the system? If
so, then the latest batch of translog might not have been fsyn'ed to disk
yet. You can use the shutdown API to shutdown the cluster, which is
the preferred way to shutdown a whole cluster (even of size of 1).

-shay.banon

On Tue, Aug 10, 2010 at 6:50 PM, Paul ppea...@gmail.com wrote:

I am currently running interactively in a console with the -f option
and ctrl-c to bring things down.

I'll try out the service wrapper and see if same issue occurs there.

Thanks

On Aug 10, 12:11 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

How do you shutdown elasticsearch?

-shay.banon

On Tue, Aug 10, 2010 at 8:14 AM, Paul ppea...@gmail.com wrote:

FYI, I seem to be able to reproduce this after indexing a decent chunk
of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the log
level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time for
the transaction logs to be written out in rare cases, probably due to
my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for write
failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through, so if
this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic..
..
..

I've attached a log capturing the issue (this log starts at a fresh
creation and the error occurred on my first restart of the cluster).
Also, to re-iterate, I am probably using way to many shards (5*49)
than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to INFO,
not sure if that is pertinent or a red herring, but could indicate a
timing issue.

As a side note, it seems that my current cluster start up time is
dependent on the amount of transactions in the trans log. When these
are large, the start up is delayed applying these transactions. Can
take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able to
copy off one of the corrupted trans logs, but it is quite large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't been
able
to reproduce over the past couple of days. Will revert back to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No exceptions
writing the translog.

Let me get a good capture of this occurring and I will post further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

You probably got it right, but can you post your config? How many
nodes do
you start on the same machine? Do you use memory based indices,
since
the
recovery should be quick if you use fs based index storage, since
it
should
be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after
stopping
and
starting my single machine cluster. I started paying attention
to
doc
counts and saw that I lost ~250 out of ~300,000 after the
restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or
searching.
My current usage pattern is pretty much constant content at ~20
docs
per minute entering the system and various boolean queries I'm
testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably
while
trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a
single
machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed
aspect
and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further
information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there.
Memory
and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss]
[index01][4]
failed to retrieve translog after [1608] operations, ignoring
the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp

ut.java:

at

org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:

at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway

$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

ppearcy · August 11, 2010, 6:08am

Unless I am missing something, I can't get the REST shutdown request
to work. Using a 0.9.1 snapshot from last week.

http://localhost:9200/_cluster/nodes/_shutdown
returns this json:
{

* cluster_name: "elasticsearch"
* nodes: { }

}

This command:
http://localhost:9200/_cluster/nodes/_all/_shutdown

returns:
No handler found for uri [/_cluster/nodes/_all/_shutdown] and method
[GET]

Thanks

On Aug 10, 11:42 am, Paul ppea...@gmail.com wrote:

In my testing, I have been disabling the process that has been
submitting docs before shutdown and giving it at least 20 seconds.

I will try out the cluster shutdown API, since that seems like the
proper procedure. Will let you know the outcome.

Thank you.

On Aug 10, 10:54 am, Shay Banon shay.ba...@elasticsearch.com wrote:

I can't find the answer, but do you index while you shutdown the system? If
so, then the latest batch of translog might not have been fsyn'ed to disk
yet. You can use the shutdown API to shutdown the cluster, which is
the preferred way to shutdown a whole cluster (even of size of 1).

-shay.banon

On Tue, Aug 10, 2010 at 6:50 PM, Paul ppea...@gmail.com wrote:

I am currently running interactively in a console with the -f option
and ctrl-c to bring things down.

I'll try out the service wrapper and see if same issue occurs there.

Thanks

On Aug 10, 12:11 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

How do you shutdown elasticsearch?

-shay.banon

On Tue, Aug 10, 2010 at 8:14 AM, Paul ppea...@gmail.com wrote:

FYI, I seem to be able to reproduce this after indexing a decent chunk
of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the log
level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time for
the transaction logs to be written out in rare cases, probably due to
my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for write
failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through, so if
this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic..
..
..

I've attached a log capturing the issue (this log starts at a fresh
creation and the error occurred on my first restart of the cluster).
Also, to re-iterate, I am probably using way to many shards (5*49)
than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to INFO,
not sure if that is pertinent or a red herring, but could indicate a
timing issue.

As a side note, it seems that my current cluster start up time is
dependent on the amount of transactions in the trans log. When these
are large, the start up is delayed applying these transactions. Can
take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able to
copy off one of the corrupted trans logs, but it is quite large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't been
able
to reproduce over the past couple of days. Will revert back to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No exceptions
writing the translog.

Let me get a good capture of this occurring and I will post further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

You probably got it right, but can you post your config? How many
nodes do
you start on the same machine? Do you use memory based indices,
since
the
recovery should be quick if you use fs based index storage, since
it
should
be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com wrote:

Hello,
I've been semi-frequently getting the error below after
stopping
and
starting my single machine cluster. I started paying attention
to
doc
counts and saw that I lost ~250 out of ~300,000 after the
restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing or
searching.
My current usage pattern is pretty much constant content at ~20
docs
per minute entering the system and various boolean queries I'm
testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes, probably
while
trying to restore the trans log. My gateway is the local fs.

I'm running a config with 49 indexes w/ 5 shards each on a
single
machine, which may be pushing something over the edge. We're
evaluating a single machine before moving to the distributed
aspect
and was hoping not to have to rebuild to add extra shards.

Any ideas what could be the cause? If there is any further
information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything there.
Memory
and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss]
[index01][4]
failed to retrieve translog after [1608] operations, ignoring
the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp

ut.java:

at

org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:

at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway

$3.onPartial(BlobStoreIndexShardGateway.java:416)
at
org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer
$1.run(AbstractFsBlobContainer.java:82)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

kimchy · August 11, 2010, 7:57am

I think you use GET on the below url, you should use POST (curl -XPOST
localhost:9200/_cluster/nodes/_shutdown)

-shay.banon

On Wed, Aug 11, 2010 at 9:08 AM, Paul ppearcy@gmail.com wrote:

Unless I am missing something, I can't get the REST shutdown request
to work. Using a 0.9.1 snapshot from last week.

http://localhost:9200/_cluster/nodes/_shutdown
returns this json:
{

cluster_name: "elasticsearch"

nodes: { }
}

This command:
http://localhost:9200/_cluster/nodes/_all/_shutdown

returns:
No handler found for uri [/_cluster/nodes/_all/_shutdown] and method
[GET]

Thanks

On Aug 10, 11:42 am, Paul ppea...@gmail.com wrote:

In my testing, I have been disabling the process that has been
submitting docs before shutdown and giving it at least 20 seconds.

I will try out the cluster shutdown API, since that seems like the
proper procedure. Will let you know the outcome.

Thank you.

On Aug 10, 10:54 am, Shay Banon shay.ba...@elasticsearch.com wrote:

I can't find the answer, but do you index while you shutdown the
system? If
so, then the latest batch of translog might not have been fsyn'ed to
disk
yet. You can use the shutdown API to shutdown the cluster, which is
the preferred way to shutdown a whole cluster (even of size of 1).

-shay.banon

On Tue, Aug 10, 2010 at 6:50 PM, Paul ppea...@gmail.com wrote:

I am currently running interactively in a console with the -f option
and ctrl-c to bring things down.

I'll try out the service wrapper and see if same issue occurs there.

Thanks

On Aug 10, 12:11 am, Shay Banon shay.ba...@elasticsearch.com
wrote:

Hi,

How do you shutdown elasticsearch?

-shay.banon

On Tue, Aug 10, 2010 at 8:14 AM, Paul ppea...@gmail.com wrote:

FYI, I seem to be able to reproduce this after indexing a decent
chunk
of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the
log
level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time
for
the transaction logs to be written out in rare cases, probably
due to
my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for
write
failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through,
so if
this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic..

..

..

I've attached a log capturing the issue (this log starts at a
fresh
creation and the error occurred on my first restart of the
cluster).
Also, to re-iterate, I am probably using way to many shards
(5*49)
than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to
INFO,
not sure if that is pertinent or a red herring, but could
indicate a
timing issue.

As a side note, it seems that my current cluster start up time
is
dependent on the amount of transactions in the trans log. When
these
are large, the start up is delayed applying these transactions.
Can
take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able
to
copy off one of the corrupted trans logs, but it is quite
large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't
been
able
to reproduce over the past couple of days. Will revert back
to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No
exceptions
writing the translog.

Let me get a good capture of this occurring and I will post
further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

You probably got it right, but can you post your config?
How many
nodes do
you start on the same machine? Do you use memory based
indices,
since
the
recovery should be quick if you use fs based index storage,
since
it
should
be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com
wrote:

Hello,
I've been semi-frequently getting the error below after
stopping
and
starting my single machine cluster. I started paying
attention
to
doc
counts and saw that I lost ~250 out of ~300,000 after the
restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing
or
searching.
My current usage pattern is pretty much constant content
at ~20
docs
per minute entering the system and various boolean
queries I'm
testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes,
probably
while
trying to restore the trans log. My gateway is the local
fs.

I'm running a config with 49 indexes w/ 5 shards each on
a
single
machine, which may be pushing something over the edge.
We're
evaluating a single machine before moving to the
distributed
aspect
and was hoping not to have to rebuild to add extra
shards.

Any ideas what could be the cause? If there is any
further
information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything
there.
Memory
and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss]
[index01][4]
failed to retrieve translog after [1608] operations,
ignoring
the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp

ut.java:

at

org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:

at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway

$3.onPartial(BlobStoreIndexShardGateway.java:416)
at

org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer

$1.run(AbstractFsBlobContainer.java:82)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

ppearcy · August 11, 2010, 4:01pm

Thank you. I have not been able to reproduce shutting down through the
proper cluster shutdown API or the service wrapper (although, I am
guessing servicewrapper probably goes through a very similar, if not
identical path than ctrl-c).

I'll keep you updated, but I feel thing are looking good.

On Aug 11, 1:57 am, Shay Banon shay.ba...@elasticsearch.com wrote:

I think you use GET on the below url, you should use POST (curl -XPOST
localhost:9200/_cluster/nodes/_shutdown)

-shay.banon

On Wed, Aug 11, 2010 at 9:08 AM, Paul ppea...@gmail.com wrote:

Unless I am missing something, I can't get the REST shutdown request
to work. Using a 0.9.1 snapshot from last week.

http://localhost:9200/_cluster/nodes/_shutdown
returns this json:
{

cluster_name: "elasticsearch"

nodes: { }
}

This command:
http://localhost:9200/_cluster/nodes/_all/_shutdown

returns:
No handler found for uri [/_cluster/nodes/_all/_shutdown] and method
[GET]

Thanks

On Aug 10, 11:42 am, Paul ppea...@gmail.com wrote:

In my testing, I have been disabling the process that has been
submitting docs before shutdown and giving it at least 20 seconds.

I will try out the cluster shutdown API, since that seems like the
proper procedure. Will let you know the outcome.

Thank you.

On Aug 10, 10:54 am, Shay Banon shay.ba...@elasticsearch.com wrote:

I can't find the answer, but do you index while you shutdown the
system? If
so, then the latest batch of translog might not have been fsyn'ed to
disk
yet. You can use the shutdown API to shutdown the cluster, which is
the preferred way to shutdown a whole cluster (even of size of 1).

-shay.banon

On Tue, Aug 10, 2010 at 6:50 PM, Paul ppea...@gmail.com wrote:

I am currently running interactively in a console with the -f option
and ctrl-c to bring things down.

I'll try out the service wrapper and see if same issue occurs there.

Thanks

On Aug 10, 12:11 am, Shay Banon shay.ba...@elasticsearch.com
wrote:

Hi,

How do you shutdown elasticsearch?

-shay.banon

On Tue, Aug 10, 2010 at 8:14 AM, Paul ppea...@gmail.com wrote:

FYI, I seem to be able to reproduce this after indexing a decent
chunk
of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the
log
level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time
for
the transaction logs to be written out in rare cases, probably
due to
my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for
write
failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through,
so if
this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic..

..

..

I've attached a log capturing the issue (this log starts at a
fresh
creation and the error occurred on my first restart of the
cluster).
Also, to re-iterate, I am probably using way to many shards
(5*49)
than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to
INFO,
not sure if that is pertinent or a red herring, but could
indicate a
timing issue.

As a side note, it seems that my current cluster start up time
is
dependent on the amount of transactions in the trans log. When
these
are large, the start up is delayed applying these transactions.
Can
take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able
to
copy off one of the corrupted trans logs, but it is quite
large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't
been
able
to reproduce over the past couple of days. Will revert back
to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No
exceptions
writing the translog.

Let me get a good capture of this occurring and I will post
further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

You probably got it right, but can you post your config?
How many
nodes do
you start on the same machine? Do you use memory based
indices,
since
the
recovery should be quick if you use fs based index storage,
since
it
should
be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com
wrote:

Hello,
I've been semi-frequently getting the error below after
stopping
and
starting my single machine cluster. I started paying
attention
to
doc
counts and saw that I lost ~250 out of ~300,000 after the
restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing
or
searching.
My current usage pattern is pretty much constant content
at ~20
docs
per minute entering the system and various boolean
queries I'm
testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes,
probably
while
trying to restore the trans log. My gateway is the local
fs.

I'm running a config with 49 indexes w/ 5 shards each on
a
single
machine, which may be pushing something over the edge.
We're
evaluating a single machine before moving to the
distributed
aspect
and was hoping not to have to rebuild to add extra
shards.

Any ideas what could be the cause? If there is any
further
information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything
there.
Memory
and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss]
[index01][4]
failed to retrieve translog after [1608] operations,
ignoring
the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp

ut.java:

at

org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:

at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway

$3.onPartial(BlobStoreIndexShardGateway.java:416)
at

org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer

$1.run(AbstractFsBlobContainer.java:82)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

ppearcy · August 13, 2010, 3:22am

Yeah, I don't believe that the issue exists with the service wrapper
or the cluster shutdown processes, which is great.

I will keep a lookout going forward, but don't consider this an issue.

Thanks,
Paul

On Aug 11, 10:01 am, Paul ppea...@gmail.com wrote:

Thank you. I have not been able to reproduce shutting down through the
proper cluster shutdown API or the service wrapper (although, I am
guessing servicewrapper probably goes through a very similar, if not
identical path than ctrl-c).

I'll keep you updated, but I feel thing are looking good.

On Aug 11, 1:57 am, Shay Banon shay.ba...@elasticsearch.com wrote:

I think you use GET on the below url, you should use POST (curl -XPOST
localhost:9200/_cluster/nodes/_shutdown)

-shay.banon

On Wed, Aug 11, 2010 at 9:08 AM, Paul ppea...@gmail.com wrote:

Unless I am missing something, I can't get the REST shutdown request
to work. Using a 0.9.1 snapshot from last week.

http://localhost:9200/_cluster/nodes/_shutdown
returns this json:
{

cluster_name: "elasticsearch"

nodes: { }
}

This command:
http://localhost:9200/_cluster/nodes/_all/_shutdown

returns:
No handler found for uri [/_cluster/nodes/_all/_shutdown] and method
[GET]

Thanks

On Aug 10, 11:42 am, Paul ppea...@gmail.com wrote:

In my testing, I have been disabling the process that has been
submitting docs before shutdown and giving it at least 20 seconds.

I will try out the cluster shutdown API, since that seems like the
proper procedure. Will let you know the outcome.

Thank you.

On Aug 10, 10:54 am, Shay Banon shay.ba...@elasticsearch.com wrote:

I can't find the answer, but do you index while you shutdown the
system? If
so, then the latest batch of translog might not have been fsyn'ed to
disk
yet. You can use the shutdown API to shutdown the cluster, which is
the preferred way to shutdown a whole cluster (even of size of 1).

-shay.banon

On Tue, Aug 10, 2010 at 6:50 PM, Paul ppea...@gmail.com wrote:

I am currently running interactively in a console with the -f option
and ctrl-c to bring things down.

I'll try out the service wrapper and see if same issue occurs there.

Thanks

On Aug 10, 12:11 am, Shay Banon shay.ba...@elasticsearch.com
wrote:

Hi,

How do you shutdown elasticsearch?

-shay.banon

On Tue, Aug 10, 2010 at 8:14 AM, Paul ppea...@gmail.com wrote:

FYI, I seem to be able to reproduce this after indexing a decent
chunk
of content on most restarts, as I've hit this a couple more times
since my last posting. I am not able to reproduce when I turn the
log
level up, though(Woot! workaround :-).

My hunch is that the shutdown sequence isn't allowing enough time
for
the transaction logs to be written out in rare cases, probably
due to
my unnatural shard count.

Looking through the code, I'd expect a
IndexShardGatewaySnapshotFailedException would be thrown for
write
failures to the gateway translog.

Many thanks.

On Aug 9, 4:03 pm, Paul ppea...@gmail.com wrote:

I posted this via Nabble, as well, and it hasn't gone through,
so if
this shows up twice, my apologies...

http://elasticsearch-users.115913.n3.nabble.com/file/n1062891/elastic..

..

..

I've attached a log capturing the issue (this log starts at a
fresh
creation and the error occurred on my first restart of the
cluster).
Also, to re-iterate, I am probably using way to many shards
(5*49)
than make sense in my current single node config.

I was only able to capture after reducing logging from DEBUG to
INFO,
not sure if that is pertinent or a red herring, but could
indicate a
timing issue.

As a side note, it seems that my current cluster start up time
is
dependent on the amount of transactions in the trans log. When
these
are large, the start up is delayed applying these transactions.
Can
take 15+ minutes to start back up.

I will be happy to provide any other details needed. I was able
to
copy off one of the corrupted trans logs, but it is quite
large.

If there is anything else I can provide, please let me know.

Thanks!!!

On Aug 8, 4:06 pm, Paul ppea...@gmail.com wrote:

Oddly, now that I have set the log level to DEBUG, I haven't
been
able
to reproduce over the past couple of days. Will revert back
to the
default and see if that gets it to kick back in.

Just running a single node, all fs based indexes. No
exceptions
writing the translog.

Let me get a good capture of this occurring and I will post
further
details including my config.

Thanks,
Paul

On Aug 8, 3:25 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

You probably got it right, but can you post your config?
How many
nodes do
you start on the same machine? Do you use memory based
indices,
since
the
recovery should be quick if you use fs based index storage,
since
it
should
be reused.

Have you see any exceptions trying to write the translog?

-shay.banon

On Sat, Aug 7, 2010 at 4:35 AM, Paul ppea...@gmail.com
wrote:

Hello,
I've been semi-frequently getting the error below after
stopping
and
starting my single machine cluster. I started paying
attention
to
doc
counts and saw that I lost ~250 out of ~300,000 after the
restart
which, I guess, were the ones in the corrupted trans log.

At the time of the shutdown, I was not actively indexing
or
searching.
My current usage pattern is pretty much constant content
at ~20
docs
per minute entering the system and various boolean
queries I'm
testing
with from a single thread.

On startup, the cluster stayed in red for ~15 minutes,
probably
while
trying to restore the trans log. My gateway is the local
fs.

I'm running a config with 49 indexes w/ 5 shards each on
a
single
machine, which may be pushing something over the edge.
We're
evaluating a single machine before moving to the
distributed
aspect
and was hoping not to have to rebuild to add extra
shards.

Any ideas what could be the cause? If there is any
further
information
I could provide, I'd be happy to.

I'll enable debug and see if I can capture anything
there.
Memory
and
CPU utilization at the time seem fine.

Thanks

[23:35:49,980][WARN ][index.gateway.fs ] [Loss]
[index01][4]
failed to retrieve translog after [1608] operations,
ignoring
the
rest, considered corrupted
java.io.EOFException
at

org.elasticsearch.common.io.stream.BytesStreamInput.readByte(BytesStreamInp

ut.java:

at

org.elasticsearch.common.io.stream.StreamInput.readVInt(StreamInput.java:

at

org.elasticsearch.index.gateway.blobstore.BlobStoreIndexShardGateway

$3.onPartial(BlobStoreIndexShardGateway.java:416)
at

org.elasticsearch.common.blobstore.fs.AbstractFsBlobContainer

$1.run(AbstractFsBlobContainer.java:82)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:

at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)

Topic		Replies	Views
Failed to retrieve transaction log - take 2 Elasticsearch	5	500	July 6, 2017
Indexing/shard failure Elasticsearch	5	998	July 6, 2017
Failed to start shard Elasticsearch	2	453	July 6, 2017
Corrupted translog Elasticsearch	18	8363	June 27, 2017
Failed to flush shard on translog threshold Elasticsearch	1	1180	July 6, 2017

Failed to retieve translog exception

Related topics