Segment errors

Hi,

I am having an issue where segments in my index keep getting errors(either
out of bound or merge) and deleting the documents when restart the server
with 'index.shard.check_on_startup: fix'. My system updates the documents
quite frequently would this be a cause of this issue? The last time that a
segment was deleted it removed 1.9 million documents(about half of my data)
from a single segment.

Thank you for any insight you could give.
Stefanie

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Do you have a broken shard so you use "fix"? It is a dangerous option
because it can modify/delete data. I'd strongly recommend not to use it
without good reason. To check for a broken shard, it is enough to use
"true".

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I know that the shard is broken but I do not know how to fix the shard
without using "fix". It ends up deleting the broken segment. I am
currently in the process of migrating away for a 1 node 2 shard setup to a
2 node 4 shard setup in the hopes that having replicas will have with the
issue of data loss. I want to figure out what might be causing the issues
that is causes the shard to crash and if there is anything that I can do to
prevent the crashes. We are inserting and updating data almost all the
time could this be a cause.

Stefanie

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

What ES version do you use? Have you checked the node for large enough file
descriptor settings, RAM, and disk space? If they are low, the OS might get
into a resource congestion when indexing to ES and it might happen that not
everything is written to disk. These kind of problems are really very, very
rare, and almost theoretical, because the translog catches most of this,
and most of time OOMs occur. Data loss by broken shards is very critical.
So, if you could provide more info (maybe reproducible steps) about how
shards are destroyed by your operations, it would be very useful. Did you
upgrade from a previous ES version by any chance?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I am currently using ES 0.90.3 and we did not upgrade from an older
version(upgrading to 0.90.5 for the new servers and will be starting with a
clean index). The machine has 16GB of RAM 8 of which are given to the
HEAP, a 256GB SSD for the data and a 750GB SATA drive that elastic search
runs off of. I am using Java Version 1.6.0_27(upgrading to 1.7.* for the
new servers). We got to about 8GB of data and the heaped seemed to be
doing well until there was an "ArrayIndexOutOfBoundsException" error, I am
not sure what caused this. The data loss is an issue but the bigger issue
is that I have to reboot the server which makes effects my site.

Stefanie

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Do you have more info about exceptions you see? Stack traces? Maybe it is
possible to help.

It is very scary that you reboot the server...

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

When I run the check these are the results that I get, I do not have stack
trace form the error:

1 of 24: name=_45ma docCount=360525
codec=Lucene42
compound=false
numFiles=13
size (MB)=561.713
diagnostics = {timestamp=1381804206548, os=Linux,
os.version=2.6.32-45-server, mergeFactor=10, source=merge,
lucene.version=4.4.0 1504776 - sarowe - 2013-07-19 02:53:42, os.arch=amd64,
mergeMaxNumSegments=-1, java.version=1.6.0_27, java.vendor=Sun Microsystems
Inc.}
has deletions [delGen=105]
test: open reader.........OK [23062 deleted docs]
test: fields..............OK [45 fields]
test: field norms.........OK [13 fields]

test: terms, freq, prox...ERROR: java.lang.ArrayIndexOutOfBoundsException:
4913738
java.lang.ArrayIndexOutOfBoundsException: 4913738
at org.apache.lucene.codecs.lucene40.BitVector.get(BitVector.java:146)
at
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum.nextDoc(Lucene41PostingsReader.java:1342)
at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:829)
at org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1216)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:607)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:361)
at
org.elasticsearch.index.shard.service.InternalIndexShard.checkIndex(InternalIndexShard.java:847)
at
org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryPrepareForTranslog(InternalIndexShard.java:568)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:200)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:174)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)

I am not sure what a better way to restart elasticsearch and fix the shard
would be. Is there something that I can run that will allow me to fix the
shard without rebooting the server?

Thank you,
Stefanie

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I think this is the wrong index checker: codec=Lucene42, a Lucene41
postings reader, and a Lucene40 BitVector. This looks like it is not going
to work out right. The "ArrayIndexOutOfBoundsException" means in most cases
an index version mismatch.

No, there is not a way to repair a shard, the "fix" means it will drop
broken segments, as you already have observed. As the "fix" operation is
harmful, you should not use it.

Please note, ES provides replica for backing up shard data, mainly for
index recovery. So you are on the right path to add more nodes and add a
replica level as this is recommended practice. In case of broken replica,
ES will download the copy from another node to replace the broken one.

But, the most important part is to find the reason for the broken shard. Do
you always run "fix" by any chance?

If not, it is by no way usual behaviour that ES produces broken shards just
by indexing. Please post more info on order to find out the cause if this
is the case.

Not sure what you mean by rebooting, but rebooting the server is never
required to fix a Lucene index issue.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I am surprised that Jörg has not repeated his usual mantra of "upgrade to
Java 7". :slight_smile: It is worth a try, however I do not think it will solve an
already corrupt segment.

Are these errors on the 1-node/2-shards cluster or have you finally moved
to the 2-nodes/4-shards setup you mentioned? 2-shards with 8GBs total is
not a large amount of data, but more replicas/shards should help.

--
Ivan

On Wed, Nov 6, 2013 at 12:28 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

I think this is the wrong index checker: codec=Lucene42, a Lucene41
postings reader, and a Lucene40 BitVector. This looks like it is not going
to work out right. The "ArrayIndexOutOfBoundsException" means in most
cases an index version mismatch.

No, there is not a way to repair a shard, the "fix" means it will drop
broken segments, as you already have observed. As the "fix" operation is
harmful, you should not use it.

Please note, ES provides replica for backing up shard data, mainly for
index recovery. So you are on the right path to add more nodes and add a
replica level as this is recommended practice. In case of broken replica,
ES will download the copy from another node to replace the broken one.

But, the most important part is to find the reason for the broken shard.
Do you always run "fix" by any chance?

If not, it is by no way usual behaviour that ES produces broken shards
just by indexing. Please post more info on order to find out the cause if
this is the case.

Not sure what you mean by rebooting, but rebooting the server is never
required to fix a Lucene index issue.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I am still running on the one server configuration and will hopeful be
moving over to the two server setup tonight, the new setup is using Java 7.
I have drop the corrupt segments, luckily my data is temporary so the loss
is not a big deal, I am just trying to figure out how to prevent this from
happening in the future.

Right now I are always running with "fix" but once I move to the new
servers I will not be. If this issue happens again I will post the error
results. Hopefully if it is an index version mismatch the new servers will
solve this issue as we are not transferring any of the data.

Thank you for all help, I will post as soon as I know if the new setup is
having the same issues.

Stefanie

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I got the message Java 7 is already planned, so nothing to add here...

Java 7 is not the best tip here in this special case, because to my
knowledge, Lucene index check success does not depend on the JVM :slight_smile:

With Java 6, memory management is quite aged, no good GC for multicore, no
good scaling. OpenJDK 6 is a backport of an early alpha-quality Java 7 core
to pass compatibility suite, but many things are broken. It does not mean
that Oracle/Sun Java 6 will crash ES or Lucene Index.

Jörg

On Wed, Nov 6, 2013 at 7:25 PM, Ivan Brusic ivan@brusic.com wrote:

I am surprised that Jörg has not repeated his usual mantra of "upgrade to
Java 7". :slight_smile: It is worth a try, however I do not think it will solve an
already corrupt segment.

Are these errors on the 1-node/2-shards cluster or have you finally moved
to the 2-nodes/4-shards setup you mentioned? 2-shards with 8GBs total is
not a large amount of data, but more replicas/shards should help.

--
Ivan

On Wed, Nov 6, 2013 at 12:28 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

I think this is the wrong index checker: codec=Lucene42, a Lucene41
postings reader, and a Lucene40 BitVector. This looks like it is not going
to work out right. The "ArrayIndexOutOfBoundsException" means in most
cases an index version mismatch.

No, there is not a way to repair a shard, the "fix" means it will drop
broken segments, as you already have observed. As the "fix" operation is
harmful, you should not use it.

Please note, ES provides replica for backing up shard data, mainly for
index recovery. So you are on the right path to add more nodes and add a
replica level as this is recommended practice. In case of broken replica,
ES will download the copy from another node to replace the broken one.

But, the most important part is to find the reason for the broken shard.
Do you always run "fix" by any chance?

If not, it is by no way usual behaviour that ES produces broken shards
just by indexing. Please post more info on order to find out the cause if
this is the case.

Not sure what you mean by rebooting, but rebooting the server is never
required to fix a Lucene index issue.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The new system has been up for about 10 days now and I haven't had any
issues yet. I want to thank everyone for there help with is issue.

Stefanie

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.