Two different shard exceptions


(ppearcy) #1

Hi Shay,
Experienced some weird behavior over the weekend on 0.10 that I
haven't seen before. Running a 2 node mirrored cluster.

  1. Searching a certain shard on certain node fails. Here is the
    exception I was getting:
    RemoteTransportException[[DM-ADSEARCHD102.dev.local][inet[/
    10.2.20.160:9301]][search/phase/query/id]]; nested:
    QueryPhaseExecutionException[[newsmedia_20100917150044][0]:
    query[filtered(+(+feedid:753 +wsodissue:44874) +__documentdate:[* TO
    1285023084000])-

FilterCacheFilterWrapper(QueryWrapperFilter((+indexid:genericnews2 +
(feedid:753 feedid:1236)) (+indexid:newsmedia
+providersubgroup:ap)))],from[0],size[500],sort[<custom:"__documentdate":
org.elasticsearch.index.field.data.FieldData$Type
$4$1@63ab3977>!,<custom:"documentkey":
org.elasticsearch.index.field.data.FieldData$Type$1$1@7e49e6bf>]:
Query Failed [Failed to execute main query]]; nested:

The search is valid and would work every other time (as the working
time, it went to a good server). To confirm, I shutdown the good node
and it would fail every time. I then brought up the good node,
shutdown the bad one and it would work every time. After bringing the
bad node back up, it was still failing the query. I was able to
resolve this by clearing the work directory on the bad node.

  1. Snapshot error. I have snapshot interval disabled and am
    snapshotting based on content received. I started receiving this
    exception:
    ERROR > Shapshot failed, index: djnf_20100917150037, shard: 0, reason:
    BroadcastShardOperationFailedException[[djnf_20100917150037][0] ];
    nested: RemoteTransportException[[dm-adsearchd103.dev.local][inet[/
    10.2.20.164:9300]][indices/gateway/snapshot/shard]]; nested:
    IndexShardGatewaySnapshotFailedException[[djnf_20100917150037][0]
    duplicate key: __2tf]; nested: IllegalArgumentException[duplicate key:
    __2tf]; (Timer-0)

This was resolved by restarting the cluster.

I have only seen both these issues once and will keep an eye out for
them again, but wanted to give a heads up, as they seem like potential
issues.

Thanks,
Paul


(Shay Banon) #2

Hi Paul,

Both are strange. Are there by any chance more detailed exceptions in the
logs?

-shay.banon

On Tue, Sep 21, 2010 at 1:45 AM, Paul ppearcy@gmail.com wrote:

Hi Shay,
Experienced some weird behavior over the weekend on 0.10 that I
haven't seen before. Running a 2 node mirrored cluster.

  1. Searching a certain shard on certain node fails. Here is the
    exception I was getting:
    RemoteTransportException[[DM-ADSEARCHD102.dev.local][inet[/
    10.2.20.160:9301]][search/phase/query/id]]; nested:
    QueryPhaseExecutionException[[newsmedia_20100917150044][0]:
    query[filtered(+(+feedid:753 +wsodissue:44874) +__documentdate:[* TO
    1285023084000])-

FilterCacheFilterWrapper(QueryWrapperFilter((+indexid:genericnews2 +
(feedid:753 feedid:1236)) (+indexid:newsmedia
+providersubgroup:ap)))],from[0],size[500],sort[<custom:"__documentdate":
org.elasticsearch.index.field.data.FieldData$Type
$4$1@63ab3977>!,<custom:"documentkey":
org.elasticsearch.index.field.data.FieldData$Type$1$1@7e49e6bf>]:
Query Failed [Failed to execute main query]]; nested:

The search is valid and would work every other time (as the working
time, it went to a good server). To confirm, I shutdown the good node
and it would fail every time. I then brought up the good node,
shutdown the bad one and it would work every time. After bringing the
bad node back up, it was still failing the query. I was able to
resolve this by clearing the work directory on the bad node.

  1. Snapshot error. I have snapshot interval disabled and am
    snapshotting based on content received. I started receiving this
    exception:
    ERROR > Shapshot failed, index: djnf_20100917150037, shard: 0, reason:
    BroadcastShardOperationFailedException[[djnf_20100917150037][0] ];
    nested: RemoteTransportException[[dm-adsearchd103.dev.local][inet[/
    10.2.20.164:9300]][indices/gateway/snapshot/shard]]; nested:
    IndexShardGatewaySnapshotFailedException[[djnf_20100917150037][0]
    duplicate key: __2tf]; nested: IllegalArgumentException[duplicate key:
    __2tf]; (Timer-0)

This was resolved by restarting the cluster.

I have only seen both these issues once and will keep an eye out for
them again, but wanted to give a heads up, as they seem like potential
issues.

Thanks,
Paul


(ppearcy) #3

Hey Shay,
Scoured the logs and, unfortunately, that is all I have. If I see
either of these again, will enable more detailed logging and see what
I capture.

Thanks,
Paul

On Sep 20, 5:48 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi Paul,

Both are strange. Are there by any chance more detailed exceptions in the
logs?

-shay.banon

On Tue, Sep 21, 2010 at 1:45 AM, Paul ppea...@gmail.com wrote:

Hi Shay,
Experienced some weird behavior over the weekend on 0.10 that I
haven't seen before. Running a 2 node mirrored cluster.

  1. Searching a certain shard on certain node fails. Here is the
    exception I was getting:
    RemoteTransportException[[DM-ADSEARCHD102.dev.local][inet[/
    10.2.20.160:9301]][search/phase/query/id]]; nested:
    QueryPhaseExecutionException[[newsmedia_20100917150044][0]:
    query[filtered(+(+feedid:753 +wsodissue:44874) +__documentdate:[* TO
    1285023084000])-

FilterCacheFilterWrapper(QueryWrapperFilter((+indexid:genericnews2 +
(feedid:753 feedid:1236)) (+indexid:newsmedia
+providersubgroup:ap)))],from[0],size[500],sort[<custom:"__documentdate":
org.elasticsearch.index.field.data.FieldData$Type
$4$1@63ab3977>!,<custom:"documentkey":
org.elasticsearch.index.field.data.FieldData$Type$1$1@7e49e6bf>]:
Query Failed [Failed to execute main query]]; nested:

The search is valid and would work every other time (as the working
time, it went to a good server). To confirm, I shutdown the good node
and it would fail every time. I then brought up the good node,
shutdown the bad one and it would work every time. After bringing the
bad node back up, it was still failing the query. I was able to
resolve this by clearing the work directory on the bad node.

  1. Snapshot error. I have snapshot interval disabled and am
    snapshotting based on content received. I started receiving this
    exception:
    ERROR > Shapshot failed, index: djnf_20100917150037, shard: 0, reason:
    BroadcastShardOperationFailedException[[djnf_20100917150037][0] ];
    nested: RemoteTransportException[[dm-adsearchd103.dev.local][inet[/
    10.2.20.164:9300]][indices/gateway/snapshot/shard]]; nested:
    IndexShardGatewaySnapshotFailedException[[djnf_20100917150037][0]
    duplicate key: __2tf]; nested: IllegalArgumentException[duplicate key:
    __2tf]; (Timer-0)

This was resolved by restarting the cluster.

I have only seen both these issues once and will keep an eye out for
them again, but wanted to give a heads up, as they seem like potential
issues.

Thanks,
Paul


(ppearcy) #4

Btw, I was unable to reproduce the search exception via curl. Does the
rest interface have internal retries? I am using the Java Node
client. Are there any retries available via that interface?

Thanks,
Paul

On Sep 20, 6:09 pm, Paul ppea...@gmail.com wrote:

Hey Shay,
Scoured the logs and, unfortunately, that is all I have. If I see
either of these again, will enable more detailed logging and see what
I capture.

Thanks,
Paul

On Sep 20, 5:48 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi Paul,

Both are strange. Are there by any chance more detailed exceptions in the
logs?

-shay.banon

On Tue, Sep 21, 2010 at 1:45 AM, Paul ppea...@gmail.com wrote:

Hi Shay,
Experienced some weird behavior over the weekend on 0.10 that I
haven't seen before. Running a 2 node mirrored cluster.

  1. Searching a certain shard on certain node fails. Here is the
    exception I was getting:
    RemoteTransportException[[DM-ADSEARCHD102.dev.local][inet[/
    10.2.20.160:9301]][search/phase/query/id]]; nested:
    QueryPhaseExecutionException[[newsmedia_20100917150044][0]:
    query[filtered(+(+feedid:753 +wsodissue:44874) +__documentdate:[* TO
    1285023084000])-

FilterCacheFilterWrapper(QueryWrapperFilter((+indexid:genericnews2 +
(feedid:753 feedid:1236)) (+indexid:newsmedia
+providersubgroup:ap)))],from[0],size[500],sort[<custom:"__documentdate":
org.elasticsearch.index.field.data.FieldData$Type
$4$1@63ab3977>!,<custom:"documentkey":
org.elasticsearch.index.field.data.FieldData$Type$1$1@7e49e6bf>]:
Query Failed [Failed to execute main query]]; nested:

The search is valid and would work every other time (as the working
time, it went to a good server). To confirm, I shutdown the good node
and it would fail every time. I then brought up the good node,
shutdown the bad one and it would work every time. After bringing the
bad node back up, it was still failing the query. I was able to
resolve this by clearing the work directory on the bad node.

  1. Snapshot error. I have snapshot interval disabled and am
    snapshotting based on content received. I started receiving this
    exception:
    ERROR > Shapshot failed, index: djnf_20100917150037, shard: 0, reason:
    BroadcastShardOperationFailedException[[djnf_20100917150037][0] ];
    nested: RemoteTransportException[[dm-adsearchd103.dev.local][inet[/
    10.2.20.164:9300]][indices/gateway/snapshot/shard]]; nested:
    IndexShardGatewaySnapshotFailedException[[djnf_20100917150037][0]
    duplicate key: __2tf]; nested: IllegalArgumentException[duplicate key:
    __2tf]; (Timer-0)

This was resolved by restarting the cluster.

I have only seen both these issues once and will keep an eye out for
them again, but wanted to give a heads up, as they seem like potential
issues.

Thanks,
Paul


(Shay Banon) #5

The REST interface uses the Java Client to do the operations, so I don't
think its related. I will go over the exceptions and see that at least they
are properly logged.

On Tue, Sep 21, 2010 at 6:59 AM, Paul ppearcy@gmail.com wrote:

Btw, I was unable to reproduce the search exception via curl. Does the
rest interface have internal retries? I am using the Java Node
client. Are there any retries available via that interface?

Thanks,
Paul

On Sep 20, 6:09 pm, Paul ppea...@gmail.com wrote:

Hey Shay,
Scoured the logs and, unfortunately, that is all I have. If I see
either of these again, will enable more detailed logging and see what
I capture.

Thanks,
Paul

On Sep 20, 5:48 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi Paul,

Both are strange. Are there by any chance more detailed exceptions in
the

logs?

-shay.banon

On Tue, Sep 21, 2010 at 1:45 AM, Paul ppea...@gmail.com wrote:

Hi Shay,
Experienced some weird behavior over the weekend on 0.10 that I
haven't seen before. Running a 2 node mirrored cluster.

  1. Searching a certain shard on certain node fails. Here is the
    exception I was getting:
    RemoteTransportException[[DM-ADSEARCHD102.dev.local][inet[/
    10.2.20.160:9301]][search/phase/query/id]]; nested:
    QueryPhaseExecutionException[[newsmedia_20100917150044][0]:
    query[filtered(+(+feedid:753 +wsodissue:44874) +__documentdate:[* TO
    1285023084000])-

FilterCacheFilterWrapper(QueryWrapperFilter((+indexid:genericnews2 +
(feedid:753 feedid:1236)) (+indexid:newsmedia

+providersubgroup:ap)))],from[0],size[500],sort[<custom:"__documentdate":

org.elasticsearch.index.field.data.FieldData$Type
$4$1@63ab3977>!,<custom:"documentkey":
org.elasticsearch.index.field.data.FieldData$Type$1$1@7e49e6bf>]:
Query Failed [Failed to execute main query]]; nested:

The search is valid and would work every other time (as the working
time, it went to a good server). To confirm, I shutdown the good node
and it would fail every time. I then brought up the good node,
shutdown the bad one and it would work every time. After bringing the
bad node back up, it was still failing the query. I was able to
resolve this by clearing the work directory on the bad node.

  1. Snapshot error. I have snapshot interval disabled and am
    snapshotting based on content received. I started receiving this
    exception:
    ERROR > Shapshot failed, index: djnf_20100917150037, shard: 0,
    reason:

BroadcastShardOperationFailedException[[djnf_20100917150037][0] ];
nested: RemoteTransportException[[dm-adsearchd103.dev.local][inet[/
10.2.20.164:9300]][indices/gateway/snapshot/shard]]; nested:
IndexShardGatewaySnapshotFailedException[[djnf_20100917150037][0]
duplicate key: __2tf]; nested: IllegalArgumentException[duplicate
key:

__2tf]; (Timer-0)

This was resolved by restarting the cluster.

I have only seen both these issues once and will keep an eye out for
them again, but wanted to give a heads up, as they seem like
potential

issues.

Thanks,
Paul


(ppearcy) #6

Hey Shay,
Hitting the snapshot failed exception, at the moment.

I tried increasing the log level, but it doesn't appear the
logging.yml file dynamically updates the log level.

Will probably start restarting nodes and playing around in a few
minutes. Let me know what I should have in place to get the necessary
information to track this down next time around. I'm on 0.10.0 and not
against moving to master, if that would help.

Thanks,
Paul

On Sep 21, 2:33 am, Shay Banon shay.ba...@elasticsearch.com wrote:

The REST interface uses the Java Client to do the operations, so I don't
think its related. I will go over the exceptions and see that at least they
are properly logged.

On Tue, Sep 21, 2010 at 6:59 AM, Paul ppea...@gmail.com wrote:

Btw, I was unable to reproduce the search exception via curl. Does the
rest interface have internal retries? I am using the Java Node
client. Are there any retries available via that interface?

Thanks,
Paul

On Sep 20, 6:09 pm, Paul ppea...@gmail.com wrote:

Hey Shay,
Scoured the logs and, unfortunately, that is all I have. If I see
either of these again, will enable more detailed logging and see what
I capture.

Thanks,
Paul

On Sep 20, 5:48 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi Paul,

Both are strange. Are there by any chance more detailed exceptions in
the

logs?

-shay.banon

On Tue, Sep 21, 2010 at 1:45 AM, Paul ppea...@gmail.com wrote:

Hi Shay,
Experienced some weird behavior over the weekend on 0.10 that I
haven't seen before. Running a 2 node mirrored cluster.

  1. Searching a certain shard on certain node fails. Here is the
    exception I was getting:
    RemoteTransportException[[DM-ADSEARCHD102.dev.local][inet[/
    10.2.20.160:9301]][search/phase/query/id]]; nested:
    QueryPhaseExecutionException[[newsmedia_20100917150044][0]:
    query[filtered(+(+feedid:753 +wsodissue:44874) +__documentdate:[* TO
    1285023084000])-

FilterCacheFilterWrapper(QueryWrapperFilter((+indexid:genericnews2 +
(feedid:753 feedid:1236)) (+indexid:newsmedia

+providersubgroup:ap)))],from[0],size[500],sort[<custom:"__documentdate":

org.elasticsearch.index.field.data.FieldData$Type
$4$1@63ab3977>!,<custom:"documentkey":
org.elasticsearch.index.field.data.FieldData$Type$1$1@7e49e6bf>]:
Query Failed [Failed to execute main query]]; nested:

The search is valid and would work every other time (as the working
time, it went to a good server). To confirm, I shutdown the good node
and it would fail every time. I then brought up the good node,
shutdown the bad one and it would work every time. After bringing the
bad node back up, it was still failing the query. I was able to
resolve this by clearing the work directory on the bad node.

  1. Snapshot error. I have snapshot interval disabled and am
    snapshotting based on content received. I started receiving this
    exception:
    ERROR > Shapshot failed, index: djnf_20100917150037, shard: 0,
    reason:

BroadcastShardOperationFailedException[[djnf_20100917150037][0] ];
nested: RemoteTransportException[[dm-adsearchd103.dev.local][inet[/
10.2.20.164:9300]][indices/gateway/snapshot/shard]]; nested:
IndexShardGatewaySnapshotFailedException[[djnf_20100917150037][0]
duplicate key: __2tf]; nested: IllegalArgumentException[duplicate
key:

__2tf]; (Timer-0)

This was resolved by restarting the cluster.

I have only seen both these issues once and will keep an eye out for
them again, but wanted to give a heads up, as they seem like
potential

issues.

Thanks,
Paul


(ppearcy) #7

FYI, bumped up gateway logging (required a node restart, which cleared
the issue), so hopefully will have more data next time around. Also,
when I shut the node down, I got a stack trace that may be of more
use.

Thanks,
Paul

On Sep 23, 10:36 am, Paul ppea...@gmail.com wrote:

Hey Shay,
Hitting the snapshot failed exception, at the moment.

I tried increasing the log level, but it doesn't appear the
logging.yml file dynamically updates the log level.

Will probably start restarting nodes and playing around in a few
minutes. Let me know what I should have in place to get the necessary
information to track this down next time around. I'm on 0.10.0 and not
against moving to master, if that would help.

Thanks,
Paul

On Sep 21, 2:33 am, Shay Banon shay.ba...@elasticsearch.com wrote:

The REST interface uses the Java Client to do the operations, so I don't
think its related. I will go over the exceptions and see that at least they
are properly logged.

On Tue, Sep 21, 2010 at 6:59 AM, Paul ppea...@gmail.com wrote:

Btw, I was unable to reproduce the search exception via curl. Does the
rest interface have internal retries? I am using the Java Node
client. Are there any retries available via that interface?

Thanks,
Paul

On Sep 20, 6:09 pm, Paul ppea...@gmail.com wrote:

Hey Shay,
Scoured the logs and, unfortunately, that is all I have. If I see
either of these again, will enable more detailed logging and see what
I capture.

Thanks,
Paul

On Sep 20, 5:48 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi Paul,

Both are strange. Are there by any chance more detailed exceptions in
the

logs?

-shay.banon

On Tue, Sep 21, 2010 at 1:45 AM, Paul ppea...@gmail.com wrote:

Hi Shay,
Experienced some weird behavior over the weekend on 0.10 that I
haven't seen before. Running a 2 node mirrored cluster.

  1. Searching a certain shard on certain node fails. Here is the
    exception I was getting:
    RemoteTransportException[[DM-ADSEARCHD102.dev.local][inet[/
    10.2.20.160:9301]][search/phase/query/id]]; nested:
    QueryPhaseExecutionException[[newsmedia_20100917150044][0]:
    query[filtered(+(+feedid:753 +wsodissue:44874) +__documentdate:[* TO
    1285023084000])-

FilterCacheFilterWrapper(QueryWrapperFilter((+indexid:genericnews2 +
(feedid:753 feedid:1236)) (+indexid:newsmedia

+providersubgroup:ap)))],from[0],size[500],sort[<custom:"__documentdate":

org.elasticsearch.index.field.data.FieldData$Type
$4$1@63ab3977>!,<custom:"documentkey":
org.elasticsearch.index.field.data.FieldData$Type$1$1@7e49e6bf>]:
Query Failed [Failed to execute main query]]; nested:

The search is valid and would work every other time (as the working
time, it went to a good server). To confirm, I shutdown the good node
and it would fail every time. I then brought up the good node,
shutdown the bad one and it would work every time. After bringing the
bad node back up, it was still failing the query. I was able to
resolve this by clearing the work directory on the bad node.

  1. Snapshot error. I have snapshot interval disabled and am
    snapshotting based on content received. I started receiving this
    exception:
    ERROR > Shapshot failed, index: djnf_20100917150037, shard: 0,
    reason:

BroadcastShardOperationFailedException[[djnf_20100917150037][0] ];
nested: RemoteTransportException[[dm-adsearchd103.dev.local][inet[/
10.2.20.164:9300]][indices/gateway/snapshot/shard]]; nested:
IndexShardGatewaySnapshotFailedException[[djnf_20100917150037][0]
duplicate key: __2tf]; nested: IllegalArgumentException[duplicate
key:

__2tf]; (Timer-0)

This was resolved by restarting the cluster.

I have only seen both these issues once and will keep an eye out for
them again, but wanted to give a heads up, as they seem like
potential

issues.

Thanks,
Paul


(Shay Banon) #8

Hi Paul,

Yea, that exception helps a lot, though very very very strange... . This
is where its coming from:

    File[] files = path.listFiles();
    if (files == null || files.length == 0) {
        return ImmutableMap.of();
    }
    ImmutableMap.Builder<String, BlobMetaData> builder =

ImmutableMap.builder();
for (File file : files) {
builder.put(file.getName(), new
PlainBlobMetaData(file.getName(), file.length()));
}
return builder.build();

Basically, as you can see, I like the files in a directory, and then build
an immutable map from them. The strange thing is that it complains that
basically the listFiles returned duplicate File... . I will fix this, but
how bizar!.

-shay.banon

On Thu, Sep 23, 2010 at 7:06 PM, Paul ppearcy@gmail.com wrote:

FYI, bumped up gateway logging (required a node restart, which cleared
the issue), so hopefully will have more data next time around. Also,
when I shut the node down, I got a stack trace that may be of more
use.

http://gist.github.com/593991

Thanks,
Paul

On Sep 23, 10:36 am, Paul ppea...@gmail.com wrote:

Hey Shay,
Hitting the snapshot failed exception, at the moment.

I tried increasing the log level, but it doesn't appear the
logging.yml file dynamically updates the log level.

Will probably start restarting nodes and playing around in a few
minutes. Let me know what I should have in place to get the necessary
information to track this down next time around. I'm on 0.10.0 and not
against moving to master, if that would help.

Thanks,
Paul

On Sep 21, 2:33 am, Shay Banon shay.ba...@elasticsearch.com wrote:

The REST interface uses the Java Client to do the operations, so I
don't

think its related. I will go over the exceptions and see that at least
they

are properly logged.

On Tue, Sep 21, 2010 at 6:59 AM, Paul ppea...@gmail.com wrote:

Btw, I was unable to reproduce the search exception via curl. Does
the

rest interface have internal retries? I am using the Java Node
client. Are there any retries available via that interface?

Thanks,
Paul

On Sep 20, 6:09 pm, Paul ppea...@gmail.com wrote:

Hey Shay,
Scoured the logs and, unfortunately, that is all I have. If I see
either of these again, will enable more detailed logging and see
what

I capture.

Thanks,
Paul

On Sep 20, 5:48 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

Hi Paul,

Both are strange. Are there by any chance more detailed
exceptions in

the

logs?

-shay.banon

On Tue, Sep 21, 2010 at 1:45 AM, Paul ppea...@gmail.com wrote:

Hi Shay,
Experienced some weird behavior over the weekend on 0.10 that
I

haven't seen before. Running a 2 node mirrored cluster.

  1. Searching a certain shard on certain node fails. Here is the
    exception I was getting:
    RemoteTransportException[[DM-ADSEARCHD102.dev.local][inet[/
    10.2.20.160:9301]][search/phase/query/id]]; nested:
    QueryPhaseExecutionException[[newsmedia_20100917150044][0]:
    query[filtered(+(+feedid:753 +wsodissue:44874)
    +__documentdate:[* TO

1285023084000])-

FilterCacheFilterWrapper(QueryWrapperFilter((+indexid:genericnews2 +

(feedid:753 feedid:1236)) (+indexid:newsmedia

+providersubgroup:ap)))],from[0],size[500],sort[<custom:"__documentdate":

org.elasticsearch.index.field.data.FieldData$Type
$4$1@63ab3977>!,<custom:"documentkey":
org.elasticsearch.index.field.data.FieldData$Type$1$1@7e49e6bf
]:

Query Failed [Failed to execute main query]]; nested:

The search is valid and would work every other time (as the
working

time, it went to a good server). To confirm, I shutdown the
good node

and it would fail every time. I then brought up the good node,
shutdown the bad one and it would work every time. After
bringing the

bad node back up, it was still failing the query. I was able to
resolve this by clearing the work directory on the bad node.

  1. Snapshot error. I have snapshot interval disabled and am
    snapshotting based on content received. I started receiving
    this

exception:
ERROR > Shapshot failed, index: djnf_20100917150037, shard: 0,
reason:

BroadcastShardOperationFailedException[[djnf_20100917150037][0]
];

nested:
RemoteTransportException[[dm-adsearchd103.dev.local][inet[/

10.2.20.164:9300]][indices/gateway/snapshot/shard]]; nested:

IndexShardGatewaySnapshotFailedException[[djnf_20100917150037][0]

duplicate key: __2tf]; nested:
IllegalArgumentException[duplicate

key:

__2tf]; (Timer-0)

This was resolved by restarting the cluster.

I have only seen both these issues once and will keep an eye
out for

them again, but wanted to give a heads up, as they seem like
potential

issues.

Thanks,
Paul


(ppearcy) #9

Awesome, thanks! I see an update already committed.

Wow, that is weird... Did some googling around and couldn't find any
details on a bug similar to this.

Probably besides the point, but here are some details on my setup:

  • Using NFS based gateway, exported such as:
    /share/adsearch dm-adsearchd103(rw,async,no_root_squash)

  • Using this version of CentOS (not my choice):
    Tikanga
    CentOS release 5.5 (Final)

  • Running this version of java:
    java version "1.6.0"
    OpenJDK Runtime Environment (build 1.6.0-b09)
    OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)

I'm a little suspect of our NFS setup, as it is carved from one of the
nodes, but this setup is only temporary.

Will plan on moving to master tonight and keep an eye out.

Thanks!
Paul

On Sep 23, 11:11 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi Paul,

Yea, that exception helps a lot, though very very very strange... . This
is where its coming from:

    File[] files = path.listFiles();
    if (files == null || files.length == 0) {
        return ImmutableMap.of();
    }
    ImmutableMap.Builder<String, BlobMetaData> builder =

ImmutableMap.builder();
for (File file : files) {
builder.put(file.getName(), new
PlainBlobMetaData(file.getName(), file.length()));
}
return builder.build();

Basically, as you can see, I like the files in a directory, and then build
an immutable map from them. The strange thing is that it complains that
basically the listFiles returned duplicate File... . I will fix this, but
how bizar!.

-shay.banon

On Thu, Sep 23, 2010 at 7:06 PM, Paul ppea...@gmail.com wrote:

FYI, bumped up gateway logging (required a node restart, which cleared
the issue), so hopefully will have more data next time around. Also,
when I shut the node down, I got a stack trace that may be of more
use.

http://gist.github.com/593991

Thanks,
Paul

On Sep 23, 10:36 am, Paul ppea...@gmail.com wrote:

Hey Shay,
Hitting the snapshot failed exception, at the moment.

I tried increasing the log level, but it doesn't appear the
logging.yml file dynamically updates the log level.

Will probably start restarting nodes and playing around in a few
minutes. Let me know what I should have in place to get the necessary
information to track this down next time around. I'm on 0.10.0 and not
against moving to master, if that would help.

Thanks,
Paul

On Sep 21, 2:33 am, Shay Banon shay.ba...@elasticsearch.com wrote:

The REST interface uses the Java Client to do the operations, so I
don't

think its related. I will go over the exceptions and see that at least
they

are properly logged.

On Tue, Sep 21, 2010 at 6:59 AM, Paul ppea...@gmail.com wrote:

Btw, I was unable to reproduce the search exception via curl. Does
the

rest interface have internal retries? I am using the Java Node
client. Are there any retries available via that interface?

Thanks,
Paul

On Sep 20, 6:09 pm, Paul ppea...@gmail.com wrote:

Hey Shay,
Scoured the logs and, unfortunately, that is all I have. If I see
either of these again, will enable more detailed logging and see
what

I capture.

Thanks,
Paul

On Sep 20, 5:48 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

Hi Paul,

Both are strange. Are there by any chance more detailed
exceptions in

the

logs?

-shay.banon

On Tue, Sep 21, 2010 at 1:45 AM, Paul ppea...@gmail.com wrote:

Hi Shay,
Experienced some weird behavior over the weekend on 0.10 that
I

haven't seen before. Running a 2 node mirrored cluster.

  1. Searching a certain shard on certain node fails. Here is the
    exception I was getting:
    RemoteTransportException[[DM-ADSEARCHD102.dev.local][inet[/
    10.2.20.160:9301]][search/phase/query/id]]; nested:
    QueryPhaseExecutionException[[newsmedia_20100917150044][0]:
    query[filtered(+(+feedid:753 +wsodissue:44874)
    +__documentdate:[* TO

1285023084000])-

FilterCacheFilterWrapper(QueryWrapperFilter((+indexid:genericnews2 +

(feedid:753 feedid:1236)) (+indexid:newsmedia

+providersubgroup:ap)))],from[0],size[500],sort[<custom:"__documentdate":

org.elasticsearch.index.field.data.FieldData$Type
$4$1@63ab3977>!,<custom:"documentkey":
org.elasticsearch.index.field.data.FieldData$Type$1$1@7e49e6bf
]:

Query Failed [Failed to execute main query]]; nested:

The search is valid and would work every other time (as the
working

time, it went to a good server). To confirm, I shutdown the
good node

and it would fail every time. I then brought up the good node,
shutdown the bad one and it would work every time. After
bringing the

bad node back up, it was still failing the query. I was able to
resolve this by clearing the work directory on the bad node.

  1. Snapshot error. I have snapshot interval disabled and am
    snapshotting based on content received. I started receiving
    this

exception:
ERROR > Shapshot failed, index: djnf_20100917150037, shard: 0,
reason:

BroadcastShardOperationFailedException[[djnf_20100917150037][0]
];

nested:
RemoteTransportException[[dm-adsearchd103.dev.local][inet[/

10.2.20.164:9300]][indices/gateway/snapshot/shard]]; nested:

IndexShardGatewaySnapshotFailedException[[djnf_20100917150037][0]

duplicate key: __2tf]; nested:
IllegalArgumentException[duplicate

key:

__2tf]; (Timer-0)

This was resolved by restarting the cluster.

I have only seen both these issues once and will keep an eye
out for

them again, but wanted to give a heads up, as they seem like
potential

issues.

Thanks,
Paul


(Shay Banon) #10

Hey,

A few things:

  1. I suggest you use sync mode and not async with NFS. As writing is done in
    the background, it does not have any performance implications. In any case I
    fsync all the files, not sure if it overrides the async mode of NFS or
    not... .

  2. The java version is pretty old. openjdk lags behind the sun jdk when it
    comes to new versions (I think in ubuntu its at b18, where a major memory
    leak in LinkedBlockingQueue was fixed in b19).

-shay.banon

On Thu, Sep 23, 2010 at 7:41 PM, Paul ppearcy@gmail.com wrote:

Awesome, thanks! I see an update already committed.

Wow, that is weird... Did some googling around and couldn't find any
details on a bug similar to this.

Probably besides the point, but here are some details on my setup:

  • Using NFS based gateway, exported such as:
    /share/adsearch dm-adsearchd103(rw,async,no_root_squash)

  • Using this version of CentOS (not my choice):
    Tikanga
    CentOS release 5.5 (Final)

  • Running this version of java:
    java version "1.6.0"
    OpenJDK Runtime Environment (build 1.6.0-b09)
    OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)

I'm a little suspect of our NFS setup, as it is carved from one of the
nodes, but this setup is only temporary.

Will plan on moving to master tonight and keep an eye out.

Thanks!
Paul

On Sep 23, 11:11 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi Paul,

Yea, that exception helps a lot, though very very very strange... .
This
is where its coming from:

    File[] files = path.listFiles();
    if (files == null || files.length == 0) {
        return ImmutableMap.of();
    }
    ImmutableMap.Builder<String, BlobMetaData> builder =

ImmutableMap.builder();
for (File file : files) {
builder.put(file.getName(), new
PlainBlobMetaData(file.getName(), file.length()));
}
return builder.build();

Basically, as you can see, I like the files in a directory, and then
build
an immutable map from them. The strange thing is that it complains that
basically the listFiles returned duplicate File... . I will fix this, but
how bizar!.

-shay.banon

On Thu, Sep 23, 2010 at 7:06 PM, Paul ppea...@gmail.com wrote:

FYI, bumped up gateway logging (required a node restart, which cleared
the issue), so hopefully will have more data next time around. Also,
when I shut the node down, I got a stack trace that may be of more
use.

http://gist.github.com/593991

Thanks,
Paul

On Sep 23, 10:36 am, Paul ppea...@gmail.com wrote:

Hey Shay,
Hitting the snapshot failed exception, at the moment.

I tried increasing the log level, but it doesn't appear the
logging.yml file dynamically updates the log level.

Will probably start restarting nodes and playing around in a few
minutes. Let me know what I should have in place to get the necessary
information to track this down next time around. I'm on 0.10.0 and
not

against moving to master, if that would help.

Thanks,
Paul

On Sep 21, 2:33 am, Shay Banon shay.ba...@elasticsearch.com wrote:

The REST interface uses the Java Client to do the operations, so I
don't

think its related. I will go over the exceptions and see that at
least

they

are properly logged.

On Tue, Sep 21, 2010 at 6:59 AM, Paul ppea...@gmail.com wrote:

Btw, I was unable to reproduce the search exception via curl.
Does

the

rest interface have internal retries? I am using the Java Node
client. Are there any retries available via that interface?

Thanks,
Paul

On Sep 20, 6:09 pm, Paul ppea...@gmail.com wrote:

Hey Shay,
Scoured the logs and, unfortunately, that is all I have. If I
see

either of these again, will enable more detailed logging and
see

what

I capture.

Thanks,
Paul

On Sep 20, 5:48 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

Hi Paul,

Both are strange. Are there by any chance more detailed
exceptions in

the

logs?

-shay.banon

On Tue, Sep 21, 2010 at 1:45 AM, Paul ppea...@gmail.com
wrote:

Hi Shay,
Experienced some weird behavior over the weekend on 0.10
that

I

haven't seen before. Running a 2 node mirrored cluster.

  1. Searching a certain shard on certain node fails. Here is
    the

exception I was getting:
RemoteTransportException[[DM-ADSEARCHD102.dev.local][inet[/
10.2.20.160:9301]][search/phase/query/id]]; nested:
QueryPhaseExecutionException[[newsmedia_20100917150044][0]:
query[filtered(+(+feedid:753 +wsodissue:44874)
+__documentdate:[* TO

1285023084000])-

FilterCacheFilterWrapper(QueryWrapperFilter((+indexid:genericnews2 +

(feedid:753 feedid:1236)) (+indexid:newsmedia

+providersubgroup:ap)))],from[0],size[500],sort[<custom:"__documentdate":

org.elasticsearch.index.field.data.FieldData$Type
$4$1@63ab3977>!,<custom:"documentkey":

org.elasticsearch.index.field.data.FieldData$Type$1$1@7e49e6bf

]:

Query Failed [Failed to execute main query]]; nested:

The search is valid and would work every other time (as the
working

time, it went to a good server). To confirm, I shutdown the
good node

and it would fail every time. I then brought up the good
node,

shutdown the bad one and it would work every time. After
bringing the

bad node back up, it was still failing the query. I was
able to

resolve this by clearing the work directory on the bad
node.

  1. Snapshot error. I have snapshot interval disabled and am
    snapshotting based on content received. I started receiving
    this

exception:
ERROR > Shapshot failed, index: djnf_20100917150037, shard:
0,

reason:

BroadcastShardOperationFailedException[[djnf_20100917150037][0]

];

nested:
RemoteTransportException[[dm-adsearchd103.dev.local][inet[/

10.2.20.164:9300]][indices/gateway/snapshot/shard]];
nested:

IndexShardGatewaySnapshotFailedException[[djnf_20100917150037][0]

duplicate key: __2tf]; nested:
IllegalArgumentException[duplicate

key:

__2tf]; (Timer-0)

This was resolved by restarting the cluster.

I have only seen both these issues once and will keep an
eye

out for

them again, but wanted to give a heads up, as they seem
like

potential

issues.

Thanks,
Paul


(Shay Banon) #11

Also, wondering out load here, but if you move to master, you might consider
using the local gateway support (the default now) and not use NFS at all.

-shay.banon

On Thu, Sep 23, 2010 at 7:49 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hey,

A few things:

  1. I suggest you use sync mode and not async with NFS. As writing is done
    in the background, it does not have any performance implications. In any
    case I fsync all the files, not sure if it overrides the async mode of NFS
    or not... .

  2. The java version is pretty old. openjdk lags behind the sun jdk when it
    comes to new versions (I think in ubuntu its at b18, where a major memory
    leak in LinkedBlockingQueue was fixed in b19).

-shay.banon

On Thu, Sep 23, 2010 at 7:41 PM, Paul ppearcy@gmail.com wrote:

Awesome, thanks! I see an update already committed.

Wow, that is weird... Did some googling around and couldn't find any
details on a bug similar to this.

Probably besides the point, but here are some details on my setup:

  • Using NFS based gateway, exported such as:
    /share/adsearch dm-adsearchd103(rw,async,no_root_squash)

  • Using this version of CentOS (not my choice):
    Tikanga
    CentOS release 5.5 (Final)

  • Running this version of java:
    java version "1.6.0"
    OpenJDK Runtime Environment (build 1.6.0-b09)
    OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)

I'm a little suspect of our NFS setup, as it is carved from one of the
nodes, but this setup is only temporary.

Will plan on moving to master tonight and keep an eye out.

Thanks!
Paul

On Sep 23, 11:11 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi Paul,

Yea, that exception helps a lot, though very very very strange... .
This
is where its coming from:

    File[] files = path.listFiles();
    if (files == null || files.length == 0) {
        return ImmutableMap.of();
    }
    ImmutableMap.Builder<String, BlobMetaData> builder =

ImmutableMap.builder();
for (File file : files) {
builder.put(file.getName(), new
PlainBlobMetaData(file.getName(), file.length()));
}
return builder.build();

Basically, as you can see, I like the files in a directory, and then
build
an immutable map from them. The strange thing is that it complains that
basically the listFiles returned duplicate File... . I will fix this,
but
how bizar!.

-shay.banon

On Thu, Sep 23, 2010 at 7:06 PM, Paul ppea...@gmail.com wrote:

FYI, bumped up gateway logging (required a node restart, which cleared
the issue), so hopefully will have more data next time around. Also,
when I shut the node down, I got a stack trace that may be of more
use.

http://gist.github.com/593991

Thanks,
Paul

On Sep 23, 10:36 am, Paul ppea...@gmail.com wrote:

Hey Shay,
Hitting the snapshot failed exception, at the moment.

I tried increasing the log level, but it doesn't appear the
logging.yml file dynamically updates the log level.

Will probably start restarting nodes and playing around in a few
minutes. Let me know what I should have in place to get the
necessary

information to track this down next time around. I'm on 0.10.0 and
not

against moving to master, if that would help.

Thanks,
Paul

On Sep 21, 2:33 am, Shay Banon shay.ba...@elasticsearch.com
wrote:

The REST interface uses the Java Client to do the operations, so I
don't

think its related. I will go over the exceptions and see that at
least

they

are properly logged.

On Tue, Sep 21, 2010 at 6:59 AM, Paul ppea...@gmail.com wrote:

Btw, I was unable to reproduce the search exception via curl.
Does

the

rest interface have internal retries? I am using the Java Node
client. Are there any retries available via that interface?

Thanks,
Paul

On Sep 20, 6:09 pm, Paul ppea...@gmail.com wrote:

Hey Shay,
Scoured the logs and, unfortunately, that is all I have. If
I see

either of these again, will enable more detailed logging and
see

what

I capture.

Thanks,
Paul

On Sep 20, 5:48 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

Hi Paul,

Both are strange. Are there by any chance more detailed
exceptions in

the

logs?

-shay.banon

On Tue, Sep 21, 2010 at 1:45 AM, Paul ppea...@gmail.com
wrote:

Hi Shay,
Experienced some weird behavior over the weekend on 0.10
that

I

haven't seen before. Running a 2 node mirrored cluster.

  1. Searching a certain shard on certain node fails. Here
    is the

exception I was getting:

RemoteTransportException[[DM-ADSEARCHD102.dev.local][inet[/

10.2.20.160:9301]][search/phase/query/id]]; nested:

QueryPhaseExecutionException[[newsmedia_20100917150044][0]:

query[filtered(+(+feedid:753 +wsodissue:44874)
+__documentdate:[* TO

1285023084000])-

FilterCacheFilterWrapper(QueryWrapperFilter((+indexid:genericnews2 +

(feedid:753 feedid:1236)) (+indexid:newsmedia

+providersubgroup:ap)))],from[0],size[500],sort[<custom:"__documentdate":

org.elasticsearch.index.field.data.FieldData$Type
$4$1@63ab3977>!,<custom:"documentkey":

org.elasticsearch.index.field.data.FieldData$Type$1$1@7e49e6bf

]:

Query Failed [Failed to execute main query]]; nested:

The search is valid and would work every other time (as
the

working

time, it went to a good server). To confirm, I shutdown
the

good node

and it would fail every time. I then brought up the good
node,

shutdown the bad one and it would work every time. After
bringing the

bad node back up, it was still failing the query. I was
able to

resolve this by clearing the work directory on the bad
node.

  1. Snapshot error. I have snapshot interval disabled and
    am

snapshotting based on content received. I started
receiving

this

exception:
ERROR > Shapshot failed, index: djnf_20100917150037,
shard: 0,

reason:

BroadcastShardOperationFailedException[[djnf_20100917150037][0]

];

nested:
RemoteTransportException[[dm-adsearchd103.dev.local][inet[/

10.2.20.164:9300]][indices/gateway/snapshot/shard]];
nested:

IndexShardGatewaySnapshotFailedException[[djnf_20100917150037][0]

duplicate key: __2tf]; nested:
IllegalArgumentException[duplicate

key:

__2tf]; (Timer-0)

This was resolved by restarting the cluster.

I have only seen both these issues once and will keep an
eye

out for

them again, but wanted to give a heads up, as they seem
like

potential

issues.

Thanks,
Paul


(ppearcy) #12

Hey Shay,
Thanks for all the help on this thread and in IRC. I had cleared the
issue I saw with index21 yesterday, by recovering from the gateway.

However, I wanted to mention that I just moved up to the 0.11 snapshot
and after start up, index21 was hosed on all nodes, as well as, the
gateway. This was the first time I had seen it effect both nodes and
the gateway.

Thanks,
Paul

On Sep 23, 11:50 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Also, wondering out load here, but if you move to master, you might consider
using the local gateway support (the default now) and not use NFS at all.

-shay.banon

On Thu, Sep 23, 2010 at 7:49 PM, Shay Banon shay.ba...@elasticsearch.comwrote:

Hey,

A few things:

  1. I suggest you use sync mode and not async with NFS. As writing is done
    in the background, it does not have any performance implications. In any
    case I fsync all the files, not sure if it overrides the async mode of NFS
    or not... .
  1. The java version is pretty old. openjdk lags behind the sun jdk when it
    comes to new versions (I think in ubuntu its at b18, where a major memory
    leak in LinkedBlockingQueue was fixed in b19).

-shay.banon

On Thu, Sep 23, 2010 at 7:41 PM, Paul ppea...@gmail.com wrote:

Awesome, thanks! I see an update already committed.

Wow, that is weird... Did some googling around and couldn't find any
details on a bug similar to this.

Probably besides the point, but here are some details on my setup:

  • Using NFS based gateway, exported such as:
    /share/adsearch dm-adsearchd103(rw,async,no_root_squash)
  • Using this version of CentOS (not my choice):
    Tikanga
    CentOS release 5.5 (Final)
  • Running this version of java:
    java version "1.6.0"
    OpenJDK Runtime Environment (build 1.6.0-b09)
    OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)

I'm a little suspect of our NFS setup, as it is carved from one of the
nodes, but this setup is only temporary.

Will plan on moving to master tonight and keep an eye out.

Thanks!
Paul

On Sep 23, 11:11 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi Paul,

Yea, that exception helps a lot, though very very very strange... .
This
is where its coming from:

    File[] files = path.listFiles();
    if (files == null || files.length == 0) {
        return ImmutableMap.of();
    }
    ImmutableMap.Builder<String, BlobMetaData> builder =

ImmutableMap.builder();
for (File file : files) {
builder.put(file.getName(), new
PlainBlobMetaData(file.getName(), file.length()));
}
return builder.build();

Basically, as you can see, I like the files in a directory, and then
build
an immutable map from them. The strange thing is that it complains that
basically the listFiles returned duplicate File... . I will fix this,
but
how bizar!.

-shay.banon

On Thu, Sep 23, 2010 at 7:06 PM, Paul ppea...@gmail.com wrote:

FYI, bumped up gateway logging (required a node restart, which cleared
the issue), so hopefully will have more data next time around. Also,
when I shut the node down, I got a stack trace that may be of more
use.

http://gist.github.com/593991

Thanks,
Paul

On Sep 23, 10:36 am, Paul ppea...@gmail.com wrote:

Hey Shay,
Hitting the snapshot failed exception, at the moment.

I tried increasing the log level, but it doesn't appear the
logging.yml file dynamically updates the log level.

Will probably start restarting nodes and playing around in a few
minutes. Let me know what I should have in place to get the
necessary

information to track this down next time around. I'm on 0.10.0 and
not

against moving to master, if that would help.

Thanks,
Paul

On Sep 21, 2:33 am, Shay Banon shay.ba...@elasticsearch.com
wrote:

The REST interface uses the Java Client to do the operations, so I
don't

think its related. I will go over the exceptions and see that at
least

they

are properly logged.

On Tue, Sep 21, 2010 at 6:59 AM, Paul ppea...@gmail.com wrote:

Btw, I was unable to reproduce the search exception via curl.
Does

the

rest interface have internal retries? I am using the Java Node
client. Are there any retries available via that interface?

Thanks,
Paul

On Sep 20, 6:09 pm, Paul ppea...@gmail.com wrote:

Hey Shay,
Scoured the logs and, unfortunately, that is all I have. If
I see

either of these again, will enable more detailed logging and
see

what

I capture.

Thanks,
Paul

On Sep 20, 5:48 pm, Shay Banon shay.ba...@elasticsearch.com
wrote:

Hi Paul,

Both are strange. Are there by any chance more detailed
exceptions in

the

logs?

-shay.banon

On Tue, Sep 21, 2010 at 1:45 AM, Paul ppea...@gmail.com
wrote:

Hi Shay,
Experienced some weird behavior over the weekend on 0.10
that

I

haven't seen before. Running a 2 node mirrored cluster.

  1. Searching a certain shard on certain node fails. Here
    is the

exception I was getting:

RemoteTransportException[[DM-ADSEARCHD102.dev.local][inet[/

10.2.20.160:9301]][search/phase/query/id]]; nested:

QueryPhaseExecutionException[[newsmedia_20100917150044][0]:

query[filtered(+(+feedid:753 +wsodissue:44874)
+__documentdate:[* TO

1285023084000])-

FilterCacheFilterWrapper(QueryWrapperFilter((+indexid:genericnews2 +

(feedid:753 feedid:1236)) (+indexid:newsmedia

+providersubgroup:ap)))],from[0],size[500],sort[<custom:"__documentdate":

org.elasticsearch.index.field.data.FieldData$Type
$4$1@63ab3977>!,<custom:"documentkey":

org.elasticsearch.index.field.data.FieldData$Type$1$1@7e49e6bf

]:

Query Failed [Failed to execute main query]]; nested:

The search is valid and would work every other time (as
the

working

time, it went to a good server). To confirm, I shutdown
the

good node

and it would fail every time. I then brought up the good
node,

shutdown the bad one and it would work every time. After
bringing the

bad node back up, it was still failing the query. I was
able to

resolve this by clearing the work directory on the bad
node.

  1. Snapshot error. I have snapshot interval disabled and
    am

snapshotting based on content received. I started
receiving

this

exception:
ERROR > Shapshot failed, index: djnf_20100917150037,
shard: 0,

reason:

BroadcastShardOperationFailedException[[djnf_20100917150037][0]

];

nested:
RemoteTransportException[[dm-adsearchd103.dev.local][inet[/

10.2.20.164:9300]][indices/gateway/snapshot/shard]];
nested:

IndexShardGatewaySnapshotFailedException[[djnf_20100917150037][0]

duplicate key: __2tf]; nested:
IllegalArgumentException[duplicate

key:

__2tf]; (Timer-0)

This was resolved by restarting the cluster.

I have only seen both these issues once and will keep an
eye

out for

them again, but wanted to give a heads up, as they seem
like

potential

issues.

Thanks,
Paul


(Shay Banon) #13

Do you mean it got removed from the gateway?

On Fri, Sep 24, 2010 at 7:41 PM, Paul ppearcy@gmail.com wrote:

Hey Shay,
Thanks for all the help on this thread and in IRC. I had cleared the
issue I saw with index21 yesterday, by recovering from the gateway.

However, I wanted to mention that I just moved up to the 0.11 snapshot
and after start up, index21 was hosed on all nodes, as well as, the
gateway. This was the first time I had seen it effect both nodes and
the gateway.

Thanks,
Paul

On Sep 23, 11:50 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Also, wondering out load here, but if you move to master, you might
consider
using the local gateway support (the default now) and not use NFS at all.

-shay.banon

On Thu, Sep 23, 2010 at 7:49 PM, Shay Banon <
shay.ba...@elasticsearch.com>wrote:

Hey,

A few things:

  1. I suggest you use sync mode and not async with NFS. As writing is
    done

in the background, it does not have any performance implications. In
any

case I fsync all the files, not sure if it overrides the async mode of
NFS

or not... .

  1. The java version is pretty old. openjdk lags behind the sun jdk when
    it

comes to new versions (I think in ubuntu its at b18, where a major
memory

leak in LinkedBlockingQueue was fixed in b19).

-shay.banon

On Thu, Sep 23, 2010 at 7:41 PM, Paul ppea...@gmail.com wrote:

Awesome, thanks! I see an update already committed.

Wow, that is weird... Did some googling around and couldn't find any
details on a bug similar to this.

Probably besides the point, but here are some details on my setup:

  • Using NFS based gateway, exported such as:
    /share/adsearch dm-adsearchd103(rw,async,no_root_squash)
  • Using this version of CentOS (not my choice):
    Tikanga
    CentOS release 5.5 (Final)
  • Running this version of java:
    java version "1.6.0"
    OpenJDK Runtime Environment (build 1.6.0-b09)
    OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)

I'm a little suspect of our NFS setup, as it is carved from one of the
nodes, but this setup is only temporary.

Will plan on moving to master tonight and keep an eye out.

Thanks!
Paul

On Sep 23, 11:11 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi Paul,

Yea, that exception helps a lot, though very very very strange...
.

This

is where its coming from:

    File[] files = path.listFiles();
    if (files == null || files.length == 0) {
        return ImmutableMap.of();
    }
    ImmutableMap.Builder<String, BlobMetaData> builder =

ImmutableMap.builder();
for (File file : files) {
builder.put(file.getName(), new
PlainBlobMetaData(file.getName(), file.length()));
}
return builder.build();

Basically, as you can see, I like the files in a directory, and then
build
an immutable map from them. The strange thing is that it complains
that

basically the listFiles returned duplicate File... . I will fix
this,

but

how bizar!.

-shay.banon

On Thu, Sep 23, 2010 at 7:06 PM, Paul ppea...@gmail.com wrote:

FYI, bumped up gateway logging (required a node restart, which
cleared

the issue), so hopefully will have more data next time around.
Also,

when I shut the node down, I got a stack trace that may be of more
use.

http://gist.github.com/593991

Thanks,
Paul

On Sep 23, 10:36 am, Paul ppea...@gmail.com wrote:

Hey Shay,
Hitting the snapshot failed exception, at the moment.

I tried increasing the log level, but it doesn't appear the
logging.yml file dynamically updates the log level.

Will probably start restarting nodes and playing around in a few
minutes. Let me know what I should have in place to get the
necessary

information to track this down next time around. I'm on 0.10.0
and

not

against moving to master, if that would help.

Thanks,
Paul

On Sep 21, 2:33 am, Shay Banon shay.ba...@elasticsearch.com
wrote:

The REST interface uses the Java Client to do the operations,
so I

don't

think its related. I will go over the exceptions and see that
at

least

they

are properly logged.

On Tue, Sep 21, 2010 at 6:59 AM, Paul ppea...@gmail.com
wrote:

Btw, I was unable to reproduce the search exception via
curl.

Does

the

rest interface have internal retries? I am using the Java
Node

client. Are there any retries available via that interface?

Thanks,
Paul

On Sep 20, 6:09 pm, Paul ppea...@gmail.com wrote:

Hey Shay,
Scoured the logs and, unfortunately, that is all I have.
If

I see

either of these again, will enable more detailed logging
and

see

what

I capture.

Thanks,
Paul

On Sep 20, 5:48 pm, Shay Banon <
shay.ba...@elasticsearch.com>

wrote:

Hi Paul,

Both are strange. Are there by any chance more
detailed

exceptions in

the

logs?

-shay.banon

On Tue, Sep 21, 2010 at 1:45 AM, Paul <
ppea...@gmail.com>

wrote:

Hi Shay,
Experienced some weird behavior over the weekend on
0.10

that

I

haven't seen before. Running a 2 node mirrored
cluster.

  1. Searching a certain shard on certain node fails.
    Here

is the

exception I was getting:

RemoteTransportException[[DM-ADSEARCHD102.dev.local][inet[/

10.2.20.160:9301]][search/phase/query/id]]; nested:

QueryPhaseExecutionException[[newsmedia_20100917150044][0]:

query[filtered(+(+feedid:753 +wsodissue:44874)
+__documentdate:[* TO

1285023084000])-

FilterCacheFilterWrapper(QueryWrapperFilter((+indexid:genericnews2 +

(feedid:753 feedid:1236)) (+indexid:newsmedia

+providersubgroup:ap)))],from[0],size[500],sort[<custom:"__documentdate":

org.elasticsearch.index.field.data.FieldData$Type
$4$1@63ab3977>!,<custom:"documentkey":

org.elasticsearch.index.field.data.FieldData$Type$1$1@7e49e6bf

]:

Query Failed [Failed to execute main query]]; nested:

The search is valid and would work every other time
(as

the

working

time, it went to a good server). To confirm, I
shutdown

the

good node

and it would fail every time. I then brought up the
good

node,

shutdown the bad one and it would work every time.
After

bringing the

bad node back up, it was still failing the query. I
was

able to

resolve this by clearing the work directory on the bad
node.

  1. Snapshot error. I have snapshot interval disabled
    and

am

snapshotting based on content received. I started
receiving

this

exception:
ERROR > Shapshot failed, index: djnf_20100917150037,
shard: 0,

reason:

BroadcastShardOperationFailedException[[djnf_20100917150037][0]

];

nested:
RemoteTransportException[[dm-adsearchd103.dev.local][inet[/

10.2.20.164:9300]][indices/gateway/snapshot/shard]];
nested:

IndexShardGatewaySnapshotFailedException[[djnf_20100917150037][0]

duplicate key: __2tf]; nested:
IllegalArgumentException[duplicate

key:

__2tf]; (Timer-0)

This was resolved by restarting the cluster.

I have only seen both these issues once and will keep
an

eye

out for

them again, but wanted to give a heads up, as they
seem

like

potential

issues.

Thanks,
Paul


(ppearcy) #14

Previously, the issue only effected one server (probably the non-
master for the shard, which is why it didn't go to the gateway).

This time around, whatever went bad in the index got persisted to the
gateway, causing all queries against that index to fail.

Thanks,
Paul

On Sep 24, 1:02 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Do you mean it got removed from the gateway?

On Fri, Sep 24, 2010 at 7:41 PM, Paul ppea...@gmail.com wrote:

Hey Shay,
Thanks for all the help on this thread and in IRC. I had cleared the
issue I saw with index21 yesterday, by recovering from the gateway.

However, I wanted to mention that I just moved up to the 0.11 snapshot
and after start up, index21 was hosed on all nodes, as well as, the
gateway. This was the first time I had seen it effect both nodes and
the gateway.

Thanks,
Paul

On Sep 23, 11:50 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Also, wondering out load here, but if you move to master, you might
consider
using the local gateway support (the default now) and not use NFS at all.

-shay.banon

On Thu, Sep 23, 2010 at 7:49 PM, Shay Banon <
shay.ba...@elasticsearch.com>wrote:

Hey,

A few things:

  1. I suggest you use sync mode and not async with NFS. As writing is
    done

in the background, it does not have any performance implications. In
any

case I fsync all the files, not sure if it overrides the async mode of
NFS

or not... .

  1. The java version is pretty old. openjdk lags behind the sun jdk when
    it

comes to new versions (I think in ubuntu its at b18, where a major
memory

leak in LinkedBlockingQueue was fixed in b19).

-shay.banon

On Thu, Sep 23, 2010 at 7:41 PM, Paul ppea...@gmail.com wrote:

Awesome, thanks! I see an update already committed.

Wow, that is weird... Did some googling around and couldn't find any
details on a bug similar to this.

Probably besides the point, but here are some details on my setup:

  • Using NFS based gateway, exported such as:
    /share/adsearch dm-adsearchd103(rw,async,no_root_squash)
  • Using this version of CentOS (not my choice):
    Tikanga
    CentOS release 5.5 (Final)
  • Running this version of java:
    java version "1.6.0"
    OpenJDK Runtime Environment (build 1.6.0-b09)
    OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)

I'm a little suspect of our NFS setup, as it is carved from one of the
nodes, but this setup is only temporary.

Will plan on moving to master tonight and keep an eye out.

Thanks!
Paul

On Sep 23, 11:11 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi Paul,

Yea, that exception helps a lot, though very very very strange...
.

This

is where its coming from:

    File[] files = path.listFiles();
    if (files == null || files.length == 0) {
        return ImmutableMap.of();
    }
    ImmutableMap.Builder<String, BlobMetaData> builder =

ImmutableMap.builder();
for (File file : files) {
builder.put(file.getName(), new
PlainBlobMetaData(file.getName(), file.length()));
}
return builder.build();

Basically, as you can see, I like the files in a directory, and then
build
an immutable map from them. The strange thing is that it complains
that

basically the listFiles returned duplicate File... . I will fix
this,

but

how bizar!.

-shay.banon

On Thu, Sep 23, 2010 at 7:06 PM, Paul ppea...@gmail.com wrote:

FYI, bumped up gateway logging (required a node restart, which
cleared

the issue), so hopefully will have more data next time around.
Also,

when I shut the node down, I got a stack trace that may be of more
use.

http://gist.github.com/593991

Thanks,
Paul

On Sep 23, 10:36 am, Paul ppea...@gmail.com wrote:

Hey Shay,
Hitting the snapshot failed exception, at the moment.

I tried increasing the log level, but it doesn't appear the
logging.yml file dynamically updates the log level.

Will probably start restarting nodes and playing around in a few
minutes. Let me know what I should have in place to get the
necessary

information to track this down next time around. I'm on 0.10.0
and

not

against moving to master, if that would help.

Thanks,
Paul

On Sep 21, 2:33 am, Shay Banon shay.ba...@elasticsearch.com
wrote:

The REST interface uses the Java Client to do the operations,
so I

don't

think its related. I will go over the exceptions and see that
at

least

they

are properly logged.

On Tue, Sep 21, 2010 at 6:59 AM, Paul ppea...@gmail.com
wrote:

Btw, I was unable to reproduce the search exception via
curl.

Does

the

rest interface have internal retries? I am using the Java
Node

client. Are there any retries available via that interface?

Thanks,
Paul

On Sep 20, 6:09 pm, Paul ppea...@gmail.com wrote:

Hey Shay,
Scoured the logs and, unfortunately, that is all I have.
If

I see

either of these again, will enable more detailed logging
and

see

what

I capture.

Thanks,
Paul

On Sep 20, 5:48 pm, Shay Banon <
shay.ba...@elasticsearch.com>

wrote:

Hi Paul,

Both are strange. Are there by any chance more
detailed

exceptions in

the

logs?

-shay.banon

On Tue, Sep 21, 2010 at 1:45 AM, Paul <
ppea...@gmail.com>

wrote:

Hi Shay,
Experienced some weird behavior over the weekend on
0.10

that

I

haven't seen before. Running a 2 node mirrored
cluster.

  1. Searching a certain shard on certain node fails.
    Here

is the

exception I was getting:

RemoteTransportException[[DM-ADSEARCHD102.dev.local][inet[/

10.2.20.160:9301]][search/phase/query/id]]; nested:

QueryPhaseExecutionException[[newsmedia_20100917150044][0]:

query[filtered(+(+feedid:753 +wsodissue:44874)
+__documentdate:[* TO

1285023084000])-

FilterCacheFilterWrapper(QueryWrapperFilter((+indexid:genericnews2 +

(feedid:753 feedid:1236)) (+indexid:newsmedia

+providersubgroup:ap)))],from[0],size[500],sort[<custom:"__documentdate":

org.elasticsearch.index.field.data.FieldData$Type
$4$1@63ab3977>!,<custom:"documentkey":

org.elasticsearch.index.field.data.FieldData$Type$1$1@7e49e6bf

]:

Query Failed [Failed to execute main query]]; nested:

The search is valid and would work every other time
(as

the

working

time, it went to a good server). To confirm, I
shutdown

the

good node

and it would fail every time. I then brought up the
good

node,

shutdown the bad one and it would work every time.
After

bringing the

bad node back up, it was still failing the query. I
was

able to

resolve this by clearing the work directory on the bad
node.

  1. Snapshot error. I have snapshot interval disabled
    and

am

snapshotting based on content received. I started
receiving

this

exception:
ERROR > Shapshot failed, index: djnf_20100917150037,
shard: 0,

reason:

BroadcastShardOperationFailedException[[djnf_20100917150037][0]

];

nested:
RemoteTransportException[[dm-adsearchd103.dev.local][inet[/

10.2.20.164:9300]][indices/gateway/snapshot/shard]];
nested:

IndexShardGatewaySnapshotFailedException[[djnf_20100917150037][0]

duplicate key: __2tf]; nested:
IllegalArgumentException[duplicate

key:

__2tf]; (Timer-0)

This was resolved by restarting the cluster.

I have only seen both these issues once and will keep
an

eye

out for

them again, but wanted to give a heads up, as they
seem

like

potential

issues.

Thanks,
Paul


(Shay Banon) #15

The fact that it even got solved when deleting the work dir and recovering
from the gateway is strange. Is there a chance that you can change that NFS
mount from async to sync?

On Fri, Sep 24, 2010 at 10:39 PM, Paul ppearcy@gmail.com wrote:

Previously, the issue only effected one server (probably the non-
master for the shard, which is why it didn't go to the gateway).

This time around, whatever went bad in the index got persisted to the
gateway, causing all queries against that index to fail.

Thanks,
Paul

On Sep 24, 1:02 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Do you mean it got removed from the gateway?

On Fri, Sep 24, 2010 at 7:41 PM, Paul ppea...@gmail.com wrote:

Hey Shay,
Thanks for all the help on this thread and in IRC. I had cleared the
issue I saw with index21 yesterday, by recovering from the gateway.

However, I wanted to mention that I just moved up to the 0.11 snapshot
and after start up, index21 was hosed on all nodes, as well as, the
gateway. This was the first time I had seen it effect both nodes and
the gateway.

Thanks,
Paul

On Sep 23, 11:50 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Also, wondering out load here, but if you move to master, you might
consider
using the local gateway support (the default now) and not use NFS at
all.

-shay.banon

On Thu, Sep 23, 2010 at 7:49 PM, Shay Banon <
shay.ba...@elasticsearch.com>wrote:

Hey,

A few things:

  1. I suggest you use sync mode and not async with NFS. As writing
    is

done

in the background, it does not have any performance implications.
In

any

case I fsync all the files, not sure if it overrides the async mode
of

NFS

or not... .

  1. The java version is pretty old. openjdk lags behind the sun jdk
    when

it

comes to new versions (I think in ubuntu its at b18, where a major
memory

leak in LinkedBlockingQueue was fixed in b19).

-shay.banon

On Thu, Sep 23, 2010 at 7:41 PM, Paul ppea...@gmail.com wrote:

Awesome, thanks! I see an update already committed.

Wow, that is weird... Did some googling around and couldn't find
any

details on a bug similar to this.

Probably besides the point, but here are some details on my setup:

  • Using NFS based gateway, exported such as:
    /share/adsearch
    dm-adsearchd103(rw,async,no_root_squash)
  • Using this version of CentOS (not my choice):
    Tikanga
    CentOS release 5.5 (Final)
  • Running this version of java:
    java version "1.6.0"
    OpenJDK Runtime Environment (build 1.6.0-b09)
    OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)

I'm a little suspect of our NFS setup, as it is carved from one of
the

nodes, but this setup is only temporary.

Will plan on moving to master tonight and keep an eye out.

Thanks!
Paul

On Sep 23, 11:11 am, Shay Banon shay.ba...@elasticsearch.com
wrote:

Hi Paul,

Yea, that exception helps a lot, though very very very
strange...

.

This

is where its coming from:

    File[] files = path.listFiles();
    if (files == null || files.length == 0) {
        return ImmutableMap.of();
    }
    ImmutableMap.Builder<String, BlobMetaData> builder =

ImmutableMap.builder();
for (File file : files) {
builder.put(file.getName(), new
PlainBlobMetaData(file.getName(), file.length()));
}
return builder.build();

Basically, as you can see, I like the files in a directory, and
then

build

an immutable map from them. The strange thing is that it
complains

that

basically the listFiles returned duplicate File... . I will fix
this,

but

how bizar!.

-shay.banon

On Thu, Sep 23, 2010 at 7:06 PM, Paul ppea...@gmail.com
wrote:

FYI, bumped up gateway logging (required a node restart, which
cleared

the issue), so hopefully will have more data next time around.
Also,

when I shut the node down, I got a stack trace that may be of
more

use.

http://gist.github.com/593991

Thanks,
Paul

On Sep 23, 10:36 am, Paul ppea...@gmail.com wrote:

Hey Shay,
Hitting the snapshot failed exception, at the moment.

I tried increasing the log level, but it doesn't appear the
logging.yml file dynamically updates the log level.

Will probably start restarting nodes and playing around in a
few

minutes. Let me know what I should have in place to get the
necessary

information to track this down next time around. I'm on
0.10.0

and

not

against moving to master, if that would help.

Thanks,
Paul

On Sep 21, 2:33 am, Shay Banon <
shay.ba...@elasticsearch.com>

wrote:

The REST interface uses the Java Client to do the
operations,

so I

don't

think its related. I will go over the exceptions and see
that

at

least

they

are properly logged.

On Tue, Sep 21, 2010 at 6:59 AM, Paul ppea...@gmail.com
wrote:

Btw, I was unable to reproduce the search exception via
curl.

Does

the

rest interface have internal retries? I am using the
Java

Node

client. Are there any retries available via that
interface?

Thanks,
Paul

On Sep 20, 6:09 pm, Paul ppea...@gmail.com wrote:

Hey Shay,
Scoured the logs and, unfortunately, that is all I
have.

If

I see

either of these again, will enable more detailed
logging

and

see

what

I capture.

Thanks,
Paul

On Sep 20, 5:48 pm, Shay Banon <
shay.ba...@elasticsearch.com>

wrote:

Hi Paul,

Both are strange. Are there by any chance more
detailed

exceptions in

the

logs?

-shay.banon

On Tue, Sep 21, 2010 at 1:45 AM, Paul <
ppea...@gmail.com>

wrote:

Hi Shay,
Experienced some weird behavior over the weekend
on

0.10

that

I

haven't seen before. Running a 2 node mirrored
cluster.

  1. Searching a certain shard on certain node
    fails.

Here

is the

exception I was getting:

RemoteTransportException[[DM-ADSEARCHD102.dev.local][inet[/

10.2.20.160:9301]][search/phase/query/id]];
nested:

QueryPhaseExecutionException[[newsmedia_20100917150044][0]:

query[filtered(+(+feedid:753 +wsodissue:44874)
+__documentdate:[* TO

1285023084000])-

FilterCacheFilterWrapper(QueryWrapperFilter((+indexid:genericnews2 +

(feedid:753 feedid:1236)) (+indexid:newsmedia

+providersubgroup:ap)))],from[0],size[500],sort[<custom:"__documentdate":

org.elasticsearch.index.field.data.FieldData$Type
$4$1@63ab3977>!,<custom:"documentkey":

org.elasticsearch.index.field.data.FieldData$Type$1$1@7e49e6bf

]:

Query Failed [Failed to execute main query]];
nested:

The search is valid and would work every other
time

(as

the

working

time, it went to a good server). To confirm, I
shutdown

the

good node

and it would fail every time. I then brought up
the

good

node,

shutdown the bad one and it would work every time.
After

bringing the

bad node back up, it was still failing the query.
I

was

able to

resolve this by clearing the work directory on the
bad

node.

  1. Snapshot error. I have snapshot interval
    disabled

and

am

snapshotting based on content received. I started
receiving

this

exception:
ERROR > Shapshot failed, index:
djnf_20100917150037,

shard: 0,

reason:

BroadcastShardOperationFailedException[[djnf_20100917150037][0]

];

nested:
RemoteTransportException[[dm-adsearchd103.dev.local][inet[/

10.2.20.164:9300
]][indices/gateway/snapshot/shard]];

nested:

IndexShardGatewaySnapshotFailedException[[djnf_20100917150037][0]

duplicate key: __2tf]; nested:
IllegalArgumentException[duplicate

key:

__2tf]; (Timer-0)

This was resolved by restarting the cluster.

I have only seen both these issues once and will
keep

an

eye

out for

them again, but wanted to give a heads up, as they
seem

like

potential

issues.

Thanks,
Paul


(ppearcy) #16

Yep, will do.

On Sep 24, 2:56 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

The fact that it even got solved when deleting the work dir and recovering
from the gateway is strange. Is there a chance that you can change that NFS
mount from async to sync?

On Fri, Sep 24, 2010 at 10:39 PM, Paul ppea...@gmail.com wrote:

Previously, the issue only effected one server (probably the non-
master for the shard, which is why it didn't go to the gateway).

This time around, whatever went bad in the index got persisted to the
gateway, causing all queries against that index to fail.

Thanks,
Paul

On Sep 24, 1:02 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Do you mean it got removed from the gateway?

On Fri, Sep 24, 2010 at 7:41 PM, Paul ppea...@gmail.com wrote:

Hey Shay,
Thanks for all the help on this thread and in IRC. I had cleared the
issue I saw with index21 yesterday, by recovering from the gateway.

However, I wanted to mention that I just moved up to the 0.11 snapshot
and after start up, index21 was hosed on all nodes, as well as, the
gateway. This was the first time I had seen it effect both nodes and
the gateway.

Thanks,
Paul

On Sep 23, 11:50 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Also, wondering out load here, but if you move to master, you might
consider
using the local gateway support (the default now) and not use NFS at
all.

-shay.banon

On Thu, Sep 23, 2010 at 7:49 PM, Shay Banon <
shay.ba...@elasticsearch.com>wrote:

Hey,

A few things:

  1. I suggest you use sync mode and not async with NFS. As writing
    is

done

in the background, it does not have any performance implications.
In

any

case I fsync all the files, not sure if it overrides the async mode
of

NFS

or not... .

  1. The java version is pretty old. openjdk lags behind the sun jdk
    when

it

comes to new versions (I think in ubuntu its at b18, where a major
memory

leak in LinkedBlockingQueue was fixed in b19).

-shay.banon

On Thu, Sep 23, 2010 at 7:41 PM, Paul ppea...@gmail.com wrote:

Awesome, thanks! I see an update already committed.

Wow, that is weird... Did some googling around and couldn't find
any

details on a bug similar to this.

Probably besides the point, but here are some details on my setup:

  • Using NFS based gateway, exported such as:
    /share/adsearch
    dm-adsearchd103(rw,async,no_root_squash)
  • Using this version of CentOS (not my choice):
    Tikanga
    CentOS release 5.5 (Final)
  • Running this version of java:
    java version "1.6.0"
    OpenJDK Runtime Environment (build 1.6.0-b09)
    OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)

I'm a little suspect of our NFS setup, as it is carved from one of
the

nodes, but this setup is only temporary.

Will plan on moving to master tonight and keep an eye out.

Thanks!
Paul

On Sep 23, 11:11 am, Shay Banon shay.ba...@elasticsearch.com
wrote:

Hi Paul,

Yea, that exception helps a lot, though very very very
strange...

.

This

is where its coming from:

    File[] files = path.listFiles();
    if (files == null || files.length == 0) {
        return ImmutableMap.of();
    }
    ImmutableMap.Builder<String, BlobMetaData> builder =

ImmutableMap.builder();
for (File file : files) {
builder.put(file.getName(), new
PlainBlobMetaData(file.getName(), file.length()));
}
return builder.build();

Basically, as you can see, I like the files in a directory, and
then

build

an immutable map from them. The strange thing is that it
complains

that

basically the listFiles returned duplicate File... . I will fix
this,

but

how bizar!.

-shay.banon

On Thu, Sep 23, 2010 at 7:06 PM, Paul ppea...@gmail.com
wrote:

FYI, bumped up gateway logging (required a node restart, which
cleared

the issue), so hopefully will have more data next time around.
Also,

when I shut the node down, I got a stack trace that may be of
more

use.

http://gist.github.com/593991

Thanks,
Paul

On Sep 23, 10:36 am, Paul ppea...@gmail.com wrote:

Hey Shay,
Hitting the snapshot failed exception, at the moment.

I tried increasing the log level, but it doesn't appear the
logging.yml file dynamically updates the log level.

Will probably start restarting nodes and playing around in a
few

minutes. Let me know what I should have in place to get the
necessary

information to track this down next time around. I'm on
0.10.0

and

not

against moving to master, if that would help.

Thanks,
Paul

On Sep 21, 2:33 am, Shay Banon <
shay.ba...@elasticsearch.com>

wrote:

The REST interface uses the Java Client to do the
operations,

so I

don't

think its related. I will go over the exceptions and see
that

at

least

they

are properly logged.

On Tue, Sep 21, 2010 at 6:59 AM, Paul ppea...@gmail.com
wrote:

Btw, I was unable to reproduce the search exception via
curl.

Does

the

rest interface have internal retries? I am using the
Java

Node

client. Are there any retries available via that
interface?

Thanks,
Paul

On Sep 20, 6:09 pm, Paul ppea...@gmail.com wrote:

Hey Shay,
Scoured the logs and, unfortunately, that is all I
have.

If

I see

either of these again, will enable more detailed
logging

and

see

what

I capture.

Thanks,
Paul

On Sep 20, 5:48 pm, Shay Banon <
shay.ba...@elasticsearch.com>

wrote:

Hi Paul,

Both are strange. Are there by any chance more
detailed

exceptions in

the

logs?

-shay.banon

On Tue, Sep 21, 2010 at 1:45 AM, Paul <
ppea...@gmail.com>

wrote:

Hi Shay,
Experienced some weird behavior over the weekend
on

0.10

that

I

haven't seen before. Running a 2 node mirrored
cluster.

  1. Searching a certain shard on certain node
    fails.

Here

is the

exception I was getting:

RemoteTransportException[[DM-ADSEARCHD102.dev.local][inet[/

10.2.20.160:9301]][search/phase/query/id]];
nested:

QueryPhaseExecutionException[[newsmedia_20100917150044][0]:

query[filtered(+(+feedid:753 +wsodissue:44874)
+__documentdate:[* TO

1285023084000])-

FilterCacheFilterWrapper(QueryWrapperFilter((+indexid:genericnews2 +

(feedid:753 feedid:1236)) (+indexid:newsmedia

+providersubgroup:ap)))],from[0],size[500],sort[<custom:"__documentdate":

org.elasticsearch.index.field.data.FieldData$Type
$4$1@63ab3977>!,<custom:"documentkey":

org.elasticsearch.index.field.data.FieldData$Type$1$1@7e49e6bf

]:

Query Failed [Failed to execute main query]];
nested:

The search is valid and would work every other
time

(as

the

working

time, it went to a good server). To confirm, I
shutdown

the

good node

and it would fail every time. I then brought up
the

good

node,

shutdown the bad one and it would work every time.
After

bringing the

bad node back up, it was still failing the query.
I

was

able to

resolve this by clearing the work directory on the
bad

node.

  1. Snapshot error. I have snapshot interval
    disabled

and

am

snapshotting based on content received. I started
receiving

this

exception:
ERROR > Shapshot failed, index:
djnf_20100917150037,

shard: 0,

reason:

BroadcastShardOperationFailedException[[djnf_20100917150037][0]

];

nested:

...

read more ยป


(system) #17