Corrupted all indices after a failure


(Franz Allan Valencia See) #1

I've been playing with ElasticSearch for about 3 weeks now. So far,
everything has been great. But lately, I started trying to index all the
data in the tables in my targeting (instead of just partially as what I've
been doing to evaluate ElasticSearch).

Currently, I store these replicated tables in different indices. That's
because after those individual replications, I will go through all of them
and do my joining in my application to index the resulting object trees.

I am able to index most of my tables, one of them has 500,000+ records. But
this one table that I have, contains about 6.5M rows. What I am doing is to
query this table, get a scrollable ResultSet, iterate over it and index them
one by one via the Java API. The whole process for replicating this
particular table takes about 1.5 hr.

However, most of the time, I will get an error during indexing (i.e.
IOException on ElasticSearch, NoShardException,
org.apache.hadoop.ipc.RemoteException (because I am using hdfs gateway),
etc). But that's ok with me, as long as I can resume from where I left of,
then I'll just recover manually (for now). However, when something goes
wrong during this indexing, all my indices will suddenly get corrupted.
Either I will have something like a NoShardException on most of
them, or the indices will totally disappear.

I even try to do a http://localhost:9200/_flush after a successful
replication, but it still doesn't solve the problem.

Any ideas where I went wrong?

Thanks,

Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see


(Shay Banon) #2

If it relates to the max open files exception, then your system is not in a
good state. Up the max open files and try again.

On Fri, Jul 16, 2010 at 1:03 PM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

I've been playing with ElasticSearch for about 3 weeks now. So far,
everything has been great. But lately, I started trying to index all the
data in the tables in my targeting (instead of just partially as what I've
been doing to evaluate ElasticSearch).

Currently, I store these replicated tables in different indices. That's
because after those individual replications, I will go through all of them
and do my joining in my application to index the resulting object trees.

I am able to index most of my tables, one of them has 500,000+ records. But
this one table that I have, contains about 6.5M rows. What I am doing is to
query this table, get a scrollable ResultSet, iterate over it and index them
one by one via the Java API. The whole process for replicating this
particular table takes about 1.5 hr.

However, most of the time, I will get an error during indexing (i.e.
IOException on ElasticSearch, NoShardException,
org.apache.hadoop.ipc.RemoteException (because I am using hdfs gateway),
etc). But that's ok with me, as long as I can resume from where I left of,
then I'll just recover manually (for now). However, when something goes
wrong during this indexing, all my indices will suddenly get corrupted.
Either I will have something like a NoShardException on most of
them, or the indices will totally disappear.

I even try to do a http://localhost:9200/_flush after a successful
replication, but it still doesn't solve the problem.

Any ideas where I went wrong?

Thanks,

Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see


(Franz Allan Valencia See) #3

I'm no longer getting that 'Too many open files' but the problem is still
there. Basically, if something goes wrong while I'm indexing ona particular
index, all my indices gets corrupted.

Sometimes, to fix it, I even have to reformat my hdfs node.

On Fri, Jul 16, 2010 at 7:20 PM, Shay Banon shay.banon@elasticsearch.comwrote:

If it relates to the max open files exception, then your system is not in a
good state. Up the max open files and try again.

On Fri, Jul 16, 2010 at 1:03 PM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

I've been playing with ElasticSearch for about 3 weeks now. So far,
everything has been great. But lately, I started trying to index all the
data in the tables in my targeting (instead of just partially as what I've
been doing to evaluate ElasticSearch).

Currently, I store these replicated tables in different indices. That's
because after those individual replications, I will go through all of them
and do my joining in my application to index the resulting object trees.

I am able to index most of my tables, one of them has 500,000+ records.
But this one table that I have, contains about 6.5M rows. What I am doing is
to query this table, get a scrollable ResultSet, iterate over it and index
them one by one via the Java API. The whole process for replicating this
particular table takes about 1.5 hr.

However, most of the time, I will get an error during indexing (i.e.
IOException on ElasticSearch, NoShardException,
org.apache.hadoop.ipc.RemoteException (because I am using hdfs gateway),
etc). But that's ok with me, as long as I can resume from where I left of,
then I'll just recover manually (for now). However, when something goes
wrong during this indexing, all my indices will suddenly get corrupted.
Either I will have something like a NoShardException on most of
them, or the indices will totally disappear.

I even try to do a http://localhost:9200/_flush after a successful
replication, but it still doesn't solve the problem.

Any ideas where I went wrong?

Thanks,

Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see


(Shay Banon) #4

In general you should not get exception while indexing (unless you index
something wrong). What type of exceptions do you get? Remind me again on
which version you are?

-shay.banon

On Sat, Jul 17, 2010 at 4:03 AM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

I'm no longer getting that 'Too many open files' but the problem is still
there. Basically, if something goes wrong while I'm indexing ona particular
index, all my indices gets corrupted.

Sometimes, to fix it, I even have to reformat my hdfs node.

On Fri, Jul 16, 2010 at 7:20 PM, Shay Banon shay.banon@elasticsearch.comwrote:

If it relates to the max open files exception, then your system is not in
a good state. Up the max open files and try again.

On Fri, Jul 16, 2010 at 1:03 PM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

I've been playing with ElasticSearch for about 3 weeks now. So far,
everything has been great. But lately, I started trying to index all the
data in the tables in my targeting (instead of just partially as what I've
been doing to evaluate ElasticSearch).

Currently, I store these replicated tables in different indices. That's
because after those individual replications, I will go through all of them
and do my joining in my application to index the resulting object trees.

I am able to index most of my tables, one of them has 500,000+ records.
But this one table that I have, contains about 6.5M rows. What I am doing is
to query this table, get a scrollable ResultSet, iterate over it and index
them one by one via the Java API. The whole process for replicating this
particular table takes about 1.5 hr.

However, most of the time, I will get an error during indexing (i.e.
IOException on ElasticSearch, NoShardException,
org.apache.hadoop.ipc.RemoteException (because I am using hdfs gateway),
etc). But that's ok with me, as long as I can resume from where I left of,
then I'll just recover manually (for now). However, when something goes
wrong during this indexing, all my indices will suddenly get corrupted.
Either I will have something like a NoShardException on most of
them, or the indices will totally disappear.

I even try to do a http://localhost:9200/_flush after a successful
replication, but it still doesn't solve the problem.

Any ideas where I went wrong?

Thanks,

Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see


(Franz Allan Valencia See) #5

Things like:

org.elasticsearch.transport.RemoteTransportException:
[Icemaster][inet[/10.0.6.1:9300]][indices/index/shard/index]
Caused by: org.elasticsearch.action.PrimaryNotStartedActionException:
[application][1] Timeout waiting for [1m]
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$4.onTimeout(TransportShardReplicationOperationAction.java:311)
at
org.elasticsearch.cluster.service.InternalClusterService$1$1.run(InternalClusterService.java:87)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

Caused by:

org.elasticsearch.action.PrimaryNotStartedActionException: [application][1]
Timeout waiting for [1m]
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$4.onTimeout(TransportShardReplicationOperationAction.java:311)
at
org.elasticsearch.cluster.service.InternalClusterService$1$1.run(InternalClusterService.java:87)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

Or Some Shard Broadcast exception, Or something like cannot update an index
because something is already updating it.

I'll post more stacktraces once I get them.

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

On Sat, Jul 17, 2010 at 4:21 PM, Shay Banon shay.banon@elasticsearch.comwrote:

In general you should not get exception while indexing (unless you index
something wrong). What type of exceptions do you get? Remind me again on
which version you are?

-shay.banon

On Sat, Jul 17, 2010 at 4:03 AM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

I'm no longer getting that 'Too many open files' but the problem is still
there. Basically, if something goes wrong while I'm indexing ona particular
index, all my indices gets corrupted.

Sometimes, to fix it, I even have to reformat my hdfs node.

On Fri, Jul 16, 2010 at 7:20 PM, Shay Banon <shay.banon@elasticsearch.com

wrote:

If it relates to the max open files exception, then your system is not in
a good state. Up the max open files and try again.

On Fri, Jul 16, 2010 at 1:03 PM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

I've been playing with ElasticSearch for about 3 weeks now. So far,
everything has been great. But lately, I started trying to index all the
data in the tables in my targeting (instead of just partially as what I've
been doing to evaluate ElasticSearch).

Currently, I store these replicated tables in different indices. That's
because after those individual replications, I will go through all of them
and do my joining in my application to index the resulting object trees.

I am able to index most of my tables, one of them has 500,000+ records.
But this one table that I have, contains about 6.5M rows. What I am doing is
to query this table, get a scrollable ResultSet, iterate over it and index
them one by one via the Java API. The whole process for replicating this
particular table takes about 1.5 hr.

However, most of the time, I will get an error during indexing (i.e.
IOException on ElasticSearch, NoShardException,
org.apache.hadoop.ipc.RemoteException (because I am using hdfs gateway),
etc). But that's ok with me, as long as I can resume from where I left of,
then I'll just recover manually (for now). However, when something goes
wrong during this indexing, all my indices will suddenly get corrupted.
Either I will have something like a NoShardException on most of
them, or the indices will totally disappear.

I even try to do a http://localhost:9200/_flush after a successful
replication, but it still doesn't solve the problem.

Any ideas where I went wrong?

Thanks,

Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see


(Franz Allan Valencia See) #6

And I'm on 0.8.0 :slight_smile:

On Sat, Jul 17, 2010 at 7:01 PM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

Things like:

org.elasticsearch.transport.RemoteTransportException:
[Icemaster][inet[/10.0.6.1:9300]][indices/index/shard/index]
Caused by: org.elasticsearch.action.PrimaryNotStartedActionException:
[application][1] Timeout waiting for [1m]
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$4.onTimeout(TransportShardReplicationOperationAction.java:311)
at
org.elasticsearch.cluster.service.InternalClusterService$1$1.run(InternalClusterService.java:87)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

Caused by:

org.elasticsearch.action.PrimaryNotStartedActionException: [application][1]
Timeout waiting for [1m]
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$4.onTimeout(TransportShardReplicationOperationAction.java:311)
at
org.elasticsearch.cluster.service.InternalClusterService$1$1.run(InternalClusterService.java:87)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

Or Some Shard Broadcast exception, Or something like cannot update an index
because something is already updating it.

I'll post more stacktraces once I get them.

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

On Sat, Jul 17, 2010 at 4:21 PM, Shay Banon shay.banon@elasticsearch.comwrote:

In general you should not get exception while indexing (unless you index
something wrong). What type of exceptions do you get? Remind me again on
which version you are?

-shay.banon

On Sat, Jul 17, 2010 at 4:03 AM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

I'm no longer getting that 'Too many open files' but the problem is still
there. Basically, if something goes wrong while I'm indexing ona particular
index, all my indices gets corrupted.

Sometimes, to fix it, I even have to reformat my hdfs node.

On Fri, Jul 16, 2010 at 7:20 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

If it relates to the max open files exception, then your system is not
in a good state. Up the max open files and try again.

On Fri, Jul 16, 2010 at 1:03 PM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

I've been playing with ElasticSearch for about 3 weeks now. So far,
everything has been great. But lately, I started trying to index all the
data in the tables in my targeting (instead of just partially as what I've
been doing to evaluate ElasticSearch).

Currently, I store these replicated tables in different indices. That's
because after those individual replications, I will go through all of them
and do my joining in my application to index the resulting object trees.

I am able to index most of my tables, one of them has 500,000+ records.
But this one table that I have, contains about 6.5M rows. What I am doing is
to query this table, get a scrollable ResultSet, iterate over it and index
them one by one via the Java API. The whole process for replicating this
particular table takes about 1.5 hr.

However, most of the time, I will get an error during indexing (i.e.
IOException on ElasticSearch, NoShardException,
org.apache.hadoop.ipc.RemoteException (because I am using hdfs gateway),
etc). But that's ok with me, as long as I can resume from where I left of,
then I'll just recover manually (for now). However, when something goes
wrong during this indexing, all my indices will suddenly get corrupted.
Either I will have something like a NoShardException on most of
them, or the indices will totally disappear.

I even try to do a http://localhost:9200/_flush after a successful
replication, but it still doesn't solve the problem.

Any ideas where I went wrong?

Thanks,

Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see


(Franz Allan Valencia See) #7

I just reproduced the problem again. Here's more information.

Basically, while indexing, I get several of these

[17:15:49,010][WARN ][org.apache.hadoop.hdfs.DFSClient] DataStreamer

Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/my/path/es/elasticsearch/indices/myentity/2/translog/translog-0 could only
be replicated to 0 nodes, instead of 1
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

at org.apache.hadoop.ipc.Client.call(Client.java:740)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy14.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy14.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2937)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2819)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

[17:15:49,019][WARN ][org.apache.hadoop.hdfs.DFSClient] Error Recovery for
block null bad datanode[0] nodes == null
[17:15:49,020][WARN ][org.apache.hadoop.hdfs.DFSClient] Could not get block
locations. Source file
"/my/path/es/elasticsearch/indices/myentity/2/translog/translog-0" -
Aborting...
[17:15:49,020][WARN ][index.gateway ] [Powerhouse][myentity][2]
Failed to snapshot (scheduled)
org.elasticsearch.index.gateway.IndexShardGatewaySnapshotFailedException:
[myentity][2] Failed to snapshot translog into [null]
at
org.elasticsearch.index.gateway.hdfs.HdfsIndexShardGateway.snapshot(HdfsIndexShardGateway.java:239)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.snapshot(IndexShardGatewayService.java:179)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.snapshot(IndexShardGatewayService.java:175)
at
org.elasticsearch.index.engine.robin.RobinEngine.snapshot(RobinEngine.java:364)
at
org.elasticsearch.index.shard.service.InternalIndexShard.snapshot(InternalIndexShard.java:377)
at
org.elasticsearch.index.gateway.IndexShardGatewayService.snapshot(IndexShardGatewayService.java:175)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$SnapshotRunnable.run(IndexShardGatewayService.java:257)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/my/path/es/elasticsearch/indices/myentity/2/translog/translog-0 could only
be replicated to 0 nodes, instead of 1
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

at org.apache.hadoop.ipc.Client.call(Client.java:740)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy14.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy14.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2937)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2819)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

After which, I try to access my index' _terms (
http://localhost:9200/myentity/_terms) and get a

{

_shards: {
- total: 5
- successful: 2
- failed: 3
- -
failures: [
- -
{
- index: "myentity"
- shard: 4
- reason: "BroadcastShardOperationFailedException[[myentity][4]
No active shard(s)]"
}
- -
{
- index: "myentity"
- shard: 3
- reason: "BroadcastShardOperationFailedException[[myentity][3]
No active shard(s)]"
}
- -
{
- index: "myentity"
- shard: 0
- reason: "BroadcastShardOperationFailedException[[myentity][0]
No active shard(s)]"
}
]
}

After restarting elasticsearch & hadoop (not sure if this is what triggered
it) , my other indices will suffer from that
BroadcastShardOperationFailedException
as well (the shard that will get a BroadcastShardOperationFailedException is
random, but is not limited to the index that failed during indexing).

Any ideas what's happening here?

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

On Sat, Jul 17, 2010 at 7:01 PM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

Things like:

org.elasticsearch.transport.RemoteTransportException:
[Icemaster][inet[/10.0.6.1:9300]][indices/index/shard/index]
Caused by: org.elasticsearch.action.PrimaryNotStartedActionException:
[application][1] Timeout waiting for [1m]
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$4.onTimeout(TransportShardReplicationOperationAction.java:311)
at
org.elasticsearch.cluster.service.InternalClusterService$1$1.run(InternalClusterService.java:87)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

Caused by:

org.elasticsearch.action.PrimaryNotStartedActionException: [application][1]
Timeout waiting for [1m]
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$4.onTimeout(TransportShardReplicationOperationAction.java:311)
at
org.elasticsearch.cluster.service.InternalClusterService$1$1.run(InternalClusterService.java:87)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

Or Some Shard Broadcast exception, Or something like cannot update an index
because something is already updating it.

I'll post more stacktraces once I get them.

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

On Sat, Jul 17, 2010 at 4:21 PM, Shay Banon shay.banon@elasticsearch.comwrote:

In general you should not get exception while indexing (unless you index
something wrong). What type of exceptions do you get? Remind me again on
which version you are?

-shay.banon

On Sat, Jul 17, 2010 at 4:03 AM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

I'm no longer getting that 'Too many open files' but the problem is still
there. Basically, if something goes wrong while I'm indexing ona particular
index, all my indices gets corrupted.

Sometimes, to fix it, I even have to reformat my hdfs node.

On Fri, Jul 16, 2010 at 7:20 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

If it relates to the max open files exception, then your system is not
in a good state. Up the max open files and try again.

On Fri, Jul 16, 2010 at 1:03 PM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

I've been playing with ElasticSearch for about 3 weeks now. So far,
everything has been great. But lately, I started trying to index all the
data in the tables in my targeting (instead of just partially as what I've
been doing to evaluate ElasticSearch).

Currently, I store these replicated tables in different indices. That's
because after those individual replications, I will go through all of them
and do my joining in my application to index the resulting object trees.

I am able to index most of my tables, one of them has 500,000+ records.
But this one table that I have, contains about 6.5M rows. What I am doing is
to query this table, get a scrollable ResultSet, iterate over it and index
them one by one via the Java API. The whole process for replicating this
particular table takes about 1.5 hr.

However, most of the time, I will get an error during indexing (i.e.
IOException on ElasticSearch, NoShardException,
org.apache.hadoop.ipc.RemoteException (because I am using hdfs gateway),
etc). But that's ok with me, as long as I can resume from where I left of,
then I'll just recover manually (for now). However, when something goes
wrong during this indexing, all my indices will suddenly get corrupted.
Either I will have something like a NoShardException on most of
them, or the indices will totally disappear.

I even try to do a http://localhost:9200/_flush after a successful
replication, but it still doesn't solve the problem.

Any ideas where I went wrong?

Thanks,

Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see


(Shay Banon) #8

Not sure about the hadoop exception, to be honest... . Regarding the other
exceptions from elasticsearch, there are some chances that you would get it
in 0.8 which have been addressed in upcoming 0.9.

On Tue, Jul 20, 2010 at 11:44 AM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

I just reproduced the problem again. Here's more information.

Basically, while indexing, I get several of these

[17:15:49,010][WARN ][org.apache.hadoop.hdfs.DFSClient] DataStreamer

Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/my/path/es/elasticsearch/indices/myentity/2/translog/translog-0 could only
be replicated to 0 nodes, instead of 1
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

at org.apache.hadoop.ipc.Client.call(Client.java:740)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy14.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy14.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2937)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2819)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

[17:15:49,019][WARN ][org.apache.hadoop.hdfs.DFSClient] Error Recovery for
block null bad datanode[0] nodes == null
[17:15:49,020][WARN ][org.apache.hadoop.hdfs.DFSClient] Could not get
block locations. Source file
"/my/path/es/elasticsearch/indices/myentity/2/translog/translog-0" -
Aborting...
[17:15:49,020][WARN ][index.gateway ] [Powerhouse][myentity][2]
Failed to snapshot (scheduled)
org.elasticsearch.index.gateway.IndexShardGatewaySnapshotFailedException:
[myentity][2] Failed to snapshot translog into [null]
at
org.elasticsearch.index.gateway.hdfs.HdfsIndexShardGateway.snapshot(HdfsIndexShardGateway.java:239)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.snapshot(IndexShardGatewayService.java:179)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.snapshot(IndexShardGatewayService.java:175)
at
org.elasticsearch.index.engine.robin.RobinEngine.snapshot(RobinEngine.java:364)
at
org.elasticsearch.index.shard.service.InternalIndexShard.snapshot(InternalIndexShard.java:377)
at
org.elasticsearch.index.gateway.IndexShardGatewayService.snapshot(IndexShardGatewayService.java:175)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$SnapshotRunnable.run(IndexShardGatewayService.java:257)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.ipc.RemoteException: java.io.IOException:
File /my/path/es/elasticsearch/indices/myentity/2/translog/translog-0 could
only be replicated to 0 nodes, instead of 1
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

at org.apache.hadoop.ipc.Client.call(Client.java:740)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy14.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy14.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2937)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2819)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

After which, I try to access my index' _terms (
http://localhost:9200/myentity/_terms) and get a

{

_shards: {
- total: 5
- successful: 2
- failed: 3
- -
failures: [
- -
{
- index: "myentity"
- shard: 4
- reason: "BroadcastShardOperationFailedException[[myentity][4]
No active shard(s)]"
}
- -
{
- index: "myentity"
- shard: 3
- reason: "BroadcastShardOperationFailedException[[myentity][3]
No active shard(s)]"
}
- -
{
- index: "myentity"
- shard: 0
- reason: "BroadcastShardOperationFailedException[[myentity][0]
No active shard(s)]"
}
]
}

After restarting elasticsearch & hadoop (not sure if this is what triggered
it) , my other indices will suffer from that BroadcastShardOperationFailedException
as well (the shard that will get a BroadcastShardOperationFailedException
is random, but is not limited to the index that failed during indexing).

Any ideas what's happening here?

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

On Sat, Jul 17, 2010 at 7:01 PM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

Things like:

org.elasticsearch.transport.RemoteTransportException:
[Icemaster][inet[/10.0.6.1:9300]][indices/index/shard/index]
Caused by: org.elasticsearch.action.PrimaryNotStartedActionException:
[application][1] Timeout waiting for [1m]
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$4.onTimeout(TransportShardReplicationOperationAction.java:311)
at
org.elasticsearch.cluster.service.InternalClusterService$1$1.run(InternalClusterService.java:87)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

Caused by:

org.elasticsearch.action.PrimaryNotStartedActionException:
[application][1] Timeout waiting for [1m]
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$4.onTimeout(TransportShardReplicationOperationAction.java:311)
at
org.elasticsearch.cluster.service.InternalClusterService$1$1.run(InternalClusterService.java:87)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

Or Some Shard Broadcast exception, Or something like cannot update an
index because something is already updating it.

I'll post more stacktraces once I get them.

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

On Sat, Jul 17, 2010 at 4:21 PM, Shay Banon <shay.banon@elasticsearch.com

wrote:

In general you should not get exception while indexing (unless you index
something wrong). What type of exceptions do you get? Remind me again on
which version you are?

-shay.banon

On Sat, Jul 17, 2010 at 4:03 AM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

I'm no longer getting that 'Too many open files' but the problem is
still there. Basically, if something goes wrong while I'm indexing ona
particular index, all my indices gets corrupted.

Sometimes, to fix it, I even have to reformat my hdfs node.

On Fri, Jul 16, 2010 at 7:20 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

If it relates to the max open files exception, then your system is not
in a good state. Up the max open files and try again.

On Fri, Jul 16, 2010 at 1:03 PM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

I've been playing with ElasticSearch for about 3 weeks now. So far,
everything has been great. But lately, I started trying to index all the
data in the tables in my targeting (instead of just partially as what I've
been doing to evaluate ElasticSearch).

Currently, I store these replicated tables in different indices.
That's because after those individual replications, I will go through all of
them and do my joining in my application to index the resulting object
trees.

I am able to index most of my tables, one of them has 500,000+
records. But this one table that I have, contains about 6.5M rows. What I am
doing is to query this table, get a scrollable ResultSet, iterate over it
and index them one by one via the Java API. The whole process for
replicating this particular table takes about 1.5 hr.

However, most of the time, I will get an error during indexing (i.e.
IOException on ElasticSearch, NoShardException,
org.apache.hadoop.ipc.RemoteException (because I am using hdfs gateway),
etc). But that's ok with me, as long as I can resume from where I left of,
then I'll just recover manually (for now). However, when something goes
wrong during this indexing, all my indices will suddenly get corrupted.
Either I will have something like a NoShardException on most of
them, or the indices will totally disappear.

I even try to do a http://localhost:9200/_flush after a successful
replication, but it still doesn't solve the problem.

Any ideas where I went wrong?

Thanks,

Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see


(Franz Allan Valencia See) #9

Thanks, I'll try this out again on 0.9 once it's out :slight_smile:

For those who are experiencing the same thing, what I did was backed up the
$ES_HOME/work/elasticsearch/index and hdfs://path/to/index every now and
then. And once this occurs, I delete the existing
$ES_HOME/work/elasticsearch/index and hdfs://path/to/index and restore my
backup (note: need to reformat namenode before restoring backup for hdfs).

Cheers,

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

On Tue, Jul 20, 2010 at 4:51 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Not sure about the hadoop exception, to be honest... . Regarding the other
exceptions from elasticsearch, there are some chances that you would get it
in 0.8 which have been addressed in upcoming 0.9.

On Tue, Jul 20, 2010 at 11:44 AM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

I just reproduced the problem again. Here's more information.

Basically, while indexing, I get several of these

[17:15:49,010][WARN ][org.apache.hadoop.hdfs.DFSClient] DataStreamer

Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/my/path/es/elasticsearch/indices/myentity/2/translog/translog-0 could only
be replicated to 0 nodes, instead of 1
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

at org.apache.hadoop.ipc.Client.call(Client.java:740)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy14.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy14.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2937)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2819)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

[17:15:49,019][WARN ][org.apache.hadoop.hdfs.DFSClient] Error Recovery
for block null bad datanode[0] nodes == null
[17:15:49,020][WARN ][org.apache.hadoop.hdfs.DFSClient] Could not get
block locations. Source file
"/my/path/es/elasticsearch/indices/myentity/2/translog/translog-0" -
Aborting...
[17:15:49,020][WARN ][index.gateway ]
[Powerhouse][myentity][2] Failed to snapshot (scheduled)
org.elasticsearch.index.gateway.IndexShardGatewaySnapshotFailedException:
[myentity][2] Failed to snapshot translog into [null]
at
org.elasticsearch.index.gateway.hdfs.HdfsIndexShardGateway.snapshot(HdfsIndexShardGateway.java:239)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.snapshot(IndexShardGatewayService.java:179)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.snapshot(IndexShardGatewayService.java:175)
at
org.elasticsearch.index.engine.robin.RobinEngine.snapshot(RobinEngine.java:364)
at
org.elasticsearch.index.shard.service.InternalIndexShard.snapshot(InternalIndexShard.java:377)
at
org.elasticsearch.index.gateway.IndexShardGatewayService.snapshot(IndexShardGatewayService.java:175)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$SnapshotRunnable.run(IndexShardGatewayService.java:257)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.ipc.RemoteException: java.io.IOException:
File /my/path/es/elasticsearch/indices/myentity/2/translog/translog-0 could
only be replicated to 0 nodes, instead of 1
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

at org.apache.hadoop.ipc.Client.call(Client.java:740)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy14.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy14.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2937)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2819)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

After which, I try to access my index' _terms (
http://localhost:9200/myentity/_terms) and get a

{

_shards: {
- total: 5
- successful: 2
- failed: 3
- -
failures: [
- -
{
- index: "myentity"
- shard: 4
- reason: "BroadcastShardOperationFailedException[[myentity][4]
No active shard(s)]"
}
- -
{
- index: "myentity"
- shard: 3
- reason: "BroadcastShardOperationFailedException[[myentity][3]
No active shard(s)]"
}
- -
{
- index: "myentity"
- shard: 0
- reason: "BroadcastShardOperationFailedException[[myentity][0]
No active shard(s)]"
}
]
}

After restarting elasticsearch & hadoop (not sure if this is what
triggered it) , my other indices will suffer from that BroadcastShardOperationFailedException
as well (the shard that will get a BroadcastShardOperationFailedException
is random, but is not limited to the index that failed during indexing).

Any ideas what's happening here?

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

On Sat, Jul 17, 2010 at 7:01 PM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

Things like:

org.elasticsearch.transport.RemoteTransportException:
[Icemaster][inet[/10.0.6.1:9300]][indices/index/shard/index]
Caused by: org.elasticsearch.action.PrimaryNotStartedActionException:
[application][1] Timeout waiting for [1m]
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$4.onTimeout(TransportShardReplicationOperationAction.java:311)
at
org.elasticsearch.cluster.service.InternalClusterService$1$1.run(InternalClusterService.java:87)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

Caused by:

org.elasticsearch.action.PrimaryNotStartedActionException:
[application][1] Timeout waiting for [1m]
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$4.onTimeout(TransportShardReplicationOperationAction.java:311)
at
org.elasticsearch.cluster.service.InternalClusterService$1$1.run(InternalClusterService.java:87)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

Or Some Shard Broadcast exception, Or something like cannot update an
index because something is already updating it.

I'll post more stacktraces once I get them.

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

On Sat, Jul 17, 2010 at 4:21 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

In general you should not get exception while indexing (unless you index
something wrong). What type of exceptions do you get? Remind me again on
which version you are?

-shay.banon

On Sat, Jul 17, 2010 at 4:03 AM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

I'm no longer getting that 'Too many open files' but the problem is
still there. Basically, if something goes wrong while I'm indexing ona
particular index, all my indices gets corrupted.

Sometimes, to fix it, I even have to reformat my hdfs node.

On Fri, Jul 16, 2010 at 7:20 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

If it relates to the max open files exception, then your system is not
in a good state. Up the max open files and try again.

On Fri, Jul 16, 2010 at 1:03 PM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

I've been playing with ElasticSearch for about 3 weeks now. So far,
everything has been great. But lately, I started trying to index all the
data in the tables in my targeting (instead of just partially as what I've
been doing to evaluate ElasticSearch).

Currently, I store these replicated tables in different indices.
That's because after those individual replications, I will go through all of
them and do my joining in my application to index the resulting object
trees.

I am able to index most of my tables, one of them has 500,000+
records. But this one table that I have, contains about 6.5M rows. What I am
doing is to query this table, get a scrollable ResultSet, iterate over it
and index them one by one via the Java API. The whole process for
replicating this particular table takes about 1.5 hr.

However, most of the time, I will get an error during indexing (i.e.
IOException on ElasticSearch, NoShardException,
org.apache.hadoop.ipc.RemoteException (because I am using hdfs gateway),
etc). But that's ok with me, as long as I can resume from where I left of,
then I'll just recover manually (for now). However, when something goes
wrong during this indexing, all my indices will suddenly get corrupted.
Either I will have something like a NoShardException on most of
them, or the indices will totally disappear.

I even try to do a http://localhost:9200/_flush after a successful
replication, but it still doesn't solve the problem.

Any ideas where I went wrong?

Thanks,

Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see


(system) #10