Intermittent shard failures with "has_child" type queries

I'm having a problem with intermittent shard failures (doesn't always
happen) when executing "has_child" type queries against parent/child
documents.

This comes up when I am running a set of functional tests against my
application. The tests create some documents, index them, update the
documents, re-index them, search on them, delete documents, etc.

Most of the time the tests run successfully, but occasionally the
search fails. When it fails, the search doesn't throw an exception,
but it doesn't find the document, and the response contains shard
failures.

However once it gets in the state where the search returns shard
failures, it consistently fails the search. It is only intermittent
in the sense that most of the time the tests succeed. Here are the
shard failures:

[shard [[8ZUq44E3QUu8XPbPvb9yfA][acme][2]], reason
[RemoteTransportException[[Ardina][inet[/10.10.30.52:9300]][search/
phase/query]]; nested: QueryPhaseExecutionException[[acme][2]:
query[title:someTitle ConstantScore(child_filter[contentFiles/content]
(filtered(file:test)-

FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:
Query Failed [Failed to execute child query [filtered(file:test)-
FilterCacheFilterWrapper(_type:contentFiles)]]]; nested: ]]

Any idea what could be causing this?

Is there anything to be careful of when re-indexing the parent and
child documents separately? Routing?

Thanks,
Lar

More info:
I check the elasticsearch log and see the following:

[2011-04-11 16:51:09,892][DEBUG][action.search.type ] [Ardina]
[acme][2], node[8ZUq44E3QUu8XPbPvb9yfA], [P], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4dae0a]
org.elasticsearch.search.query.QueryPhaseExecutionException: [acme]
[2]: query[ConstantScore(child_filter[contentFiles/content]
(filtered(file:mission file:statement)-

FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:
Query Failed [Failed to execute child query [filtered(file:mission
file:statement)->FilterCacheFilterWrapper(_type:contentFiles)]]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:

at

org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:
169)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:
132)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:
76)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:192)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.access$000(TransportSearchTypeAction.java:75)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction$1.run(TransportSearchTypeAction.java:151)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at
org.elasticsearch.index.query.type.child.ChildCollector.collect(ChildCollector.java:
75)
at org.apache.lucene.search.Scorer.score(Scorer.java:62)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:
212)
at
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:
159)
at org.apache.lucene.search.Searcher.search(Searcher.java:67)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:
143)
... 9 more

On Apr 11, 4:23 pm, lmader lmaderintre...@gmail.com wrote:

I'm having a problem with intermittent shard failures (doesn't always
happen) when executing "has_child" type queries against parent/child
documents.

This comes up when I am running a set of functional tests against my
application. The tests create some documents, index them, update the
documents, re-index them, search on them, delete documents, etc.

Most of the time the tests run successfully, but occasionally the
search fails. When it fails, the search doesn't throw an exception,
but it doesn't find the document, and the response contains shard
failures.

However once it gets in the state where the search returns shard
failures, it consistently fails the search. It is only intermittent
in the sense that most of the time the tests succeed. Here are the
shard failures:

[shard [[8ZUq44E3QUu8XPbPvb9yfA][acme][2]], reason
[RemoteTransportException[[Ardina][inet[/10.10.30.52:9300]][search/
phase/query]]; nested: QueryPhaseExecutionException[[acme][2]:
query[title:someTitle ConstantScore(child_filter[contentFiles/content]
(filtered(file:test)->FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:

Query Failed [Failed to execute child query [filtered(file:test)-

FilterCacheFilterWrapper(_type:contentFiles)]]]; nested: ]]

Any idea what could be causing this?

Is there anything to be careful of when re-indexing the parent and
child documents separately? Routing?

Thanks,
Lar

Heya,

I can see where it happens, but not sure why. Can you provide a recreation (even if intermediately) that I can run?

-shay.banon
On Tuesday, April 12, 2011 at 2:54 AM, lmader wrote:

More info:
I check the elasticsearch log and see the following:

[2011-04-11 16:51:09,892][DEBUG][action.search.type ] [Ardina]
[acme][2], node[8ZUq44E3QUu8XPbPvb9yfA], [P], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4dae0a]
org.elasticsearch.search.query.QueryPhaseExecutionException: [acme]
[2]: query[ConstantScore(child_filter[contentFiles/content]
(filtered(file:mission file:statement)-

FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:
Query Failed [Failed to execute child query [filtered(file:mission
file:statement)->FilterCacheFilterWrapper(_type:contentFiles)]]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:

at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:
169)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:
132)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:
76)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:192)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.access$000(TransportSearchTypeAction.java:75)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction$1.run(TransportSearchTypeAction.java:151)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at
org.elasticsearch.index.query.type.child.ChildCollector.collect(ChildCollector.java:
75)
at org.apache.lucene.search.Scorer.score(Scorer.java:62)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:
212)
at
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:
159)
at org.apache.lucene.search.Searcher.search(Searcher.java:67)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:
143)
... 9 more

On Apr 11, 4:23 pm, lmader lmaderintre...@gmail.com wrote:

I'm having a problem with intermittent shard failures (doesn't always
happen) when executing "has_child" type queries against parent/child
documents.

This comes up when I am running a set of functional tests against my
application. The tests create some documents, index them, update the
documents, re-index them, search on them, delete documents, etc.

Most of the time the tests run successfully, but occasionally the
search fails. When it fails, the search doesn't throw an exception,
but it doesn't find the document, and the response contains shard
failures.

However once it gets in the state where the search returns shard
failures, it consistently fails the search. It is only intermittent
in the sense that most of the time the tests succeed. Here are the
shard failures:

[shard [[8ZUq44E3QUu8XPbPvb9yfA][acme][2]], reason
[RemoteTransportException[[Ardina][inet[/10.10.30.52:9300]][search/
phase/query]]; nested: QueryPhaseExecutionException[[acme][2]:
query[title:someTitle ConstantScore(child_filter[contentFiles/content]
(filtered(file:test)->FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:

Query Failed [Failed to execute child query [filtered(file:test)-

FilterCacheFilterWrapper(_type:contentFiles)]]]; nested: ]]

Any idea what could be causing this?

Is there anything to be careful of when re-indexing the parent and
child documents separately? Routing?

Thanks,
Lar

Thanks for the speedy reply!

What would you need to run this? I can send you the elasticsearch
data files when it is in this state, along with the query (and any
other supporting files). Let me know what you need and where to send
it.

Thanks so much!
Lar

On Apr 12, 2:25 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I can see where it happens, but not sure why. Can you provide a recreation (even if intermediately) that I can run?

-shay.banon

On Tuesday, April 12, 2011 at 2:54 AM, lmader wrote:

More info:
I check the elasticsearch log and see the following:

[2011-04-11 16:51:09,892][DEBUG][action.search.type ] [Ardina]
[acme][2], node[8ZUq44E3QUu8XPbPvb9yfA], [P], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4dae0a]
org.elasticsearch.search.query.QueryPhaseExecutionException: [acme]
[2]: query[ConstantScore(child_filter[contentFiles/content]
(filtered(file:mission file:statement)-

FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:
Query Failed [Failed to execute child query [filtered(file:mission
file:statement)->FilterCacheFilterWrapper(_type:contentFiles)]]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:

at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java :
169)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQue ry(SearchServiceTransportAction.java:
132)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java :
76)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:192)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.access$000(TransportSearchTypeAction.java:75)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction$1.run(TransportSearchTypeAction.java:151)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at
org.elasticsearch.index.query.type.child.ChildCollector.collect(ChildCollec tor.java:
75)
at org.apache.lucene.search.Scorer.score(Scorer.java:62)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:
212)
at
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexS earcher.java:
159)
at org.apache.lucene.search.Searcher.search(Searcher.java:67)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:
143)
... 9 more

On Apr 11, 4:23 pm, lmader lmaderintre...@gmail.com wrote:

I'm having a problem with intermittent shard failures (doesn't always
happen) when executing "has_child" type queries against parent/child
documents.

This comes up when I am running a set of functional tests against my
application. The tests create some documents, index them, update the
documents, re-index them, search on them, delete documents, etc.

Most of the time the tests run successfully, but occasionally the
search fails. When it fails, the search doesn't throw an exception,
but it doesn't find the document, and the response contains shard
failures.

However once it gets in the state where the search returns shard
failures, it consistently fails the search. It is only intermittent
in the sense that most of the time the tests succeed. Here are the
shard failures:

[shard [[8ZUq44E3QUu8XPbPvb9yfA][acme][2]], reason
[RemoteTransportException[[Ardina][inet[/10.10.30.52:9300]][search/
phase/query]]; nested: QueryPhaseExecutionException[[acme][2]:
query[title:someTitle ConstantScore(child_filter[contentFiles/content]
(filtered(file:test)->FilterCacheFilterWrapper(_type:contentFiles)))],from[ 0],size[10]:

Query Failed [Failed to execute child query [filtered(file:test)-

FilterCacheFilterWrapper(_type:contentFiles)]]]; nested: ]]

Any idea what could be causing this?

Is there anything to be careful of when re-indexing the parent and
child documents separately? Routing?

Thanks,
Lar

The simplest way would be something like a curl recreation where it first index some data, and then run the query (possibly multiple times) to create the error. This will allow me to convert it into an integration test and fix it, and also make sure it will never happen again :).

If you can try and work on the above, it would be very helpful. If its a no go, then ping me privately on IRC where I can download the data files and I can try and recreate it that way.

-shay.banon
On Tuesday, April 12, 2011 at 9:17 PM, lmader wrote:

Thanks for the speedy reply!

What would you need to run this? I can send you the elasticsearch
data files when it is in this state, along with the query (and any
other supporting files). Let me know what you need and where to send
it.

Thanks so much!
Lar

On Apr 12, 2:25 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I can see where it happens, but not sure why. Can you provide a recreation (even if intermediately) that I can run?

-shay.banon

On Tuesday, April 12, 2011 at 2:54 AM, lmader wrote:

More info:
I check the elasticsearch log and see the following:

[2011-04-11 16:51:09,892][DEBUG][action.search.type ] [Ardina]
[acme][2], node[8ZUq44E3QUu8XPbPvb9yfA], [P], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4dae0a]
org.elasticsearch.search.query.QueryPhaseExecutionException: [acme]
[2]: query[ConstantScore(child_filter[contentFiles/content]
(filtered(file:mission file:statement)-

FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:
Query Failed [Failed to execute child query [filtered(file:mission
file:statement)->FilterCacheFilterWrapper(_type:contentFiles)]]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:

at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java :
169)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQue ry(SearchServiceTransportAction.java:
132)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java :
76)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:192)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.access$000(TransportSearchTypeAction.java:75)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction$1.run(TransportSearchTypeAction.java:151)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at
org.elasticsearch.index.query.type.child.ChildCollector.collect(ChildCollec tor.java:
75)
at org.apache.lucene.search.Scorer.score(Scorer.java:62)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:
212)
at
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexS earcher.java:
159)
at org.apache.lucene.search.Searcher.search(Searcher.java:67)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:
143)
... 9 more

On Apr 11, 4:23 pm, lmader lmaderintre...@gmail.com wrote:

I'm having a problem with intermittent shard failures (doesn't always
happen) when executing "has_child" type queries against parent/child
documents.

This comes up when I am running a set of functional tests against my
application. The tests create some documents, index them, update the
documents, re-index them, search on them, delete documents, etc.

Most of the time the tests run successfully, but occasionally the
search fails. When it fails, the search doesn't throw an exception,
but it doesn't find the document, and the response contains shard
failures.

However once it gets in the state where the search returns shard
failures, it consistently fails the search. It is only intermittent
in the sense that most of the time the tests succeed. Here are the
shard failures:

[shard [[8ZUq44E3QUu8XPbPvb9yfA][acme][2]], reason
[RemoteTransportException[[Ardina][inet[/10.10.30.52:9300]][search/
phase/query]]; nested: QueryPhaseExecutionException[[acme][2]:
query[title:someTitle ConstantScore(child_filter[contentFiles/content]
(filtered(file:test)->FilterCacheFilterWrapper(_type:contentFiles)))],from[ 0],size[10]:

Query Failed [Failed to execute child query [filtered(file:test)-

FilterCacheFilterWrapper(_type:contentFiles)]]]; nested: ]]

Any idea what could be causing this?

Is there anything to be careful of when re-indexing the parent and
child documents separately? Routing?

Thanks,
Lar

Ok, I think I understand what is happening now.

The error condition that I was seeing is reproducible with the
following sequence of events:

  1. Index a parent document
  2. Index a child document of the parent from step 1.
  3. Delete the parent document
  4. Execute a has_child search. This will return a shard failure and
    the log file shows the null pointer exception.

Well, it makes sense that this would represent an error condition.
The situation comes up somehow with our functional tests that are
doing a lot of indexing, re-indexing, and deleting, and it seems that
somehow the deletes are getting interleaved with the updates.

I think what is happening in our app test is that a delete operation
is underway that is deleting the child and then the parent, but an
indexing operation snuck in between and indexed the parent and the
child, like this:

  1. On thread 1 - Delete Child document executes, but before the Delete
    Parent starts...
  2. On another thread the same child gets indexed. Now the child doc
    is back.
  3. Now, back on thread 1 - the Delete Parent operation executes. Now
    the child is orphaned, and the has_child search will fail.

I think we can make changes in our app to avoid this. If we delete
the parent first, then the above scenario wouldn't result in the shard
failure, and just an orphaned parent instead, which isn't so bad.

Are there other strategies to avoid this? Is there a way to execute a
delete against both the child and the parent in an atomic fashion, or
perhaps a way to have the deletion of the parent automatically delete
the child?

Kind regards,
Lar

On Apr 12, 2:34 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

The simplest way would be something like a curl recreation where it first index some data, and then run the query (possibly multiple times) to create the error. This will allow me to convert it into an integration test and fix it, and also make sure it will never happen again :).

If you can try and work on the above, it would be very helpful. If its a no go, then ping me privately on IRC where I can download the data files and I can try and recreate it that way.

-shay.banon

On Tuesday, April 12, 2011 at 9:17 PM, lmader wrote:

Thanks for the speedy reply!

What would you need to run this? I can send you the elasticsearch
data files when it is in this state, along with the query (and any
other supporting files). Let me know what you need and where to send
it.

Thanks so much!
Lar

On Apr 12, 2:25 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I can see where it happens, but not sure why. Can you provide a recreation (even if intermediately) that I can run?

-shay.banon

On Tuesday, April 12, 2011 at 2:54 AM, lmader wrote:

More info:
I check the elasticsearch log and see the following:

[2011-04-11 16:51:09,892][DEBUG][action.search.type ] [Ardina]
[acme][2], node[8ZUq44E3QUu8XPbPvb9yfA], [P], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4dae0a]
org.elasticsearch.search.query.QueryPhaseExecutionException: [acme]
[2]: query[ConstantScore(child_filter[contentFiles/content]
(filtered(file:mission file:statement)-

FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:
Query Failed [Failed to execute child query [filtered(file:mission
file:statement)->FilterCacheFilterWrapper(_type:contentFiles)]]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:

at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java :
169)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQue ry(SearchServiceTransportAction.java:
132)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java :
76)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:192)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.access$000(TransportSearchTypeAction.java:75)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction$1.run(TransportSearchTypeAction.java:151)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at
org.elasticsearch.index.query.type.child.ChildCollector.collect(ChildCollec tor.java:
75)
at org.apache.lucene.search.Scorer.score(Scorer.java:62)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:
212)
at
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexS earcher.java:
159)
at org.apache.lucene.search.Searcher.search(Searcher.java:67)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:
143)
... 9 more

On Apr 11, 4:23 pm, lmader lmaderintre...@gmail.com wrote:

I'm having a problem with intermittent shard failures (doesn't always
happen) when executing "has_child" type queries against parent/child
documents.

This comes up when I am running a set of functional tests against my
application. The tests create some documents, index them, update the
documents, re-index them, search on them, delete documents, etc.

Most of the time the tests run successfully, but occasionally the
search fails. When it fails, the search doesn't throw an exception,
but it doesn't find the document, and the response contains shard
failures.

However once it gets in the state where the search returns shard
failures, it consistently fails the search. It is only intermittent
in the sense that most of the time the tests succeed. Here are the
shard failures:

[shard [[8ZUq44E3QUu8XPbPvb9yfA][acme][2]], reason
[RemoteTransportException[[Ardina][inet[/10.10.30.52:9300]][search/
phase/query]]; nested: QueryPhaseExecutionException[[acme][2]:
query[title:someTitle ConstantScore(child_filter[contentFiles/content]
(filtered(file:test)->FilterCacheFilterWrapper(_type:contentFiles)))],from[ 0],size[10]:

Query Failed [Failed to execute child query [filtered(file:test)-

FilterCacheFilterWrapper(_type:contentFiles)]]]; nested: ]]

Any idea what could be causing this?

Is there anything to be careful of when re-indexing the parent and
child documents separately? Routing?

Thanks,
Lar

Can you recreate the first scenario, it should not fail in this case, just ignore those docs with no parents.
On Wednesday, April 13, 2011 at 5:38 AM, lmader wrote:

Ok, I think I understand what is happening now.

The error condition that I was seeing is reproducible with the
following sequence of events:

  1. Index a parent document
  2. Index a child document of the parent from step 1.
  3. Delete the parent document
  4. Execute a has_child search. This will return a shard failure and
    the log file shows the null pointer exception.

Well, it makes sense that this would represent an error condition.
The situation comes up somehow with our functional tests that are
doing a lot of indexing, re-indexing, and deleting, and it seems that
somehow the deletes are getting interleaved with the updates.

I think what is happening in our app test is that a delete operation
is underway that is deleting the child and then the parent, but an
indexing operation snuck in between and indexed the parent and the
child, like this:

  1. On thread 1 - Delete Child document executes, but before the Delete
    Parent starts...
  2. On another thread the same child gets indexed. Now the child doc
    is back.
  3. Now, back on thread 1 - the Delete Parent operation executes. Now
    the child is orphaned, and the has_child search will fail.

I think we can make changes in our app to avoid this. If we delete
the parent first, then the above scenario wouldn't result in the shard
failure, and just an orphaned parent instead, which isn't so bad.

Are there other strategies to avoid this? Is there a way to execute a
delete against both the child and the parent in an atomic fashion, or
perhaps a way to have the deletion of the parent automatically delete
the child?

Kind regards,
Lar

On Apr 12, 2:34 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

The simplest way would be something like a curl recreation where it first index some data, and then run the query (possibly multiple times) to create the error. This will allow me to convert it into an integration test and fix it, and also make sure it will never happen again :).

If you can try and work on the above, it would be very helpful. If its a no go, then ping me privately on IRC where I can download the data files and I can try and recreate it that way.

-shay.banon

On Tuesday, April 12, 2011 at 9:17 PM, lmader wrote:

Thanks for the speedy reply!

What would you need to run this? I can send you the elasticsearch
data files when it is in this state, along with the query (and any
other supporting files). Let me know what you need and where to send
it.

Thanks so much!
Lar

On Apr 12, 2:25 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I can see where it happens, but not sure why. Can you provide a recreation (even if intermediately) that I can run?

-shay.banon

On Tuesday, April 12, 2011 at 2:54 AM, lmader wrote:

More info:
I check the elasticsearch log and see the following:

[2011-04-11 16:51:09,892][DEBUG][action.search.type ] [Ardina]
[acme][2], node[8ZUq44E3QUu8XPbPvb9yfA], [P], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4dae0a]
org.elasticsearch.search.query.QueryPhaseExecutionException: [acme]
[2]: query[ConstantScore(child_filter[contentFiles/content]
(filtered(file:mission file:statement)-

FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:
Query Failed [Failed to execute child query [filtered(file:mission
file:statement)->FilterCacheFilterWrapper(_type:contentFiles)]]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:

at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java :
169)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQue ry(SearchServiceTransportAction.java:
132)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java :
76)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:192)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.access$000(TransportSearchTypeAction.java:75)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction$1.run(TransportSearchTypeAction.java:151)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at
org.elasticsearch.index.query.type.child.ChildCollector.collect(ChildCollec tor.java:
75)
at org.apache.lucene.search.Scorer.score(Scorer.java:62)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:
212)
at
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexS earcher.java:
159)
at org.apache.lucene.search.Searcher.search(Searcher.java:67)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:
143)
... 9 more

On Apr 11, 4:23 pm, lmader lmaderintre...@gmail.com wrote:

I'm having a problem with intermittent shard failures (doesn't always
happen) when executing "has_child" type queries against parent/child
documents.

This comes up when I am running a set of functional tests against my
application. The tests create some documents, index them, update the
documents, re-index them, search on them, delete documents, etc.

Most of the time the tests run successfully, but occasionally the
search fails. When it fails, the search doesn't throw an exception,
but it doesn't find the document, and the response contains shard
failures.

However once it gets in the state where the search returns shard
failures, it consistently fails the search. It is only intermittent
in the sense that most of the time the tests succeed. Here are the
shard failures:

[shard [[8ZUq44E3QUu8XPbPvb9yfA][acme][2]], reason
[RemoteTransportException[[Ardina][inet[/10.10.30.52:9300]][search/
phase/query]]; nested: QueryPhaseExecutionException[[acme][2]:
query[title:someTitle ConstantScore(child_filter[contentFiles/content]
(filtered(file:test)->FilterCacheFilterWrapper(_type:contentFiles)))],from[ 0],size[10]:

Query Failed [Failed to execute child query [filtered(file:test)-

FilterCacheFilterWrapper(_type:contentFiles)]]]; nested: ]]

Any idea what could be causing this?

Is there anything to be careful of when re-indexing the parent and
child documents separately? Routing?

Thanks,
Lar

Ok, I've created a fairly simple java program that reliably reproduces
the problem.

The program is available on gist at: https://gist.github.com/920209

The program needs to load a pdf document. Probably any pdf would
suffice but I've uploaded the one I used for testing this here:

---- begin really long url -----

http://ym6oxg.sn2.livefilestore.com/y1px5Kd8OyeW32v4CjbjM-db6pYUXb4Llr496sP_0Q9wpXTICejS9Huq0Y8byQRn8B_tnruYPXneq_gvDPuNDMA7-0P4KW9wGC6/document1.pdf?download&psid=1
---- end really long url ----

The program essentially just loops on:
creating a parent and child doc, where the child is the pdf
executing a search
deleting the parent and child docs
repeat

After looping for a short while a shard failure occurs, and a null
pointer exception is logged in elasticsearch.

Please take a look, as this looks to be a legitimate issue.

Thanks so much,
Lar

Thanks for the effort. Not sure why the test case needs to index a pdf, can you simplify it so it just indexes some text and thats it? It will be simpler for me to turn it into a testcase in ES.
On Thursday, April 14, 2011 at 10:11 PM, lmader wrote:

Ok, I've created a fairly simple java program that reliably reproduces
the problem.

The program is available on gist at: Demonstrates a possible concurrency bug in the elasticsearch parent\child feature · GitHub

The program needs to load a pdf document. Probably any pdf would
suffice but I've uploaded the one I used for testing this here:

---- begin really long url -----

http://ym6oxg.sn2.livefilestore.com/y1px5Kd8OyeW32v4CjbjM-db6pYUXb4Llr496sP_0Q9wpXTICejS9Huq0Y8byQRn8B_tnruYPXneq_gvDPuNDMA7-0P4KW9wGC6/document1.pdf?download&psid=1
---- end really long url ----

The program essentially just loops on:
creating a parent and child doc, where the child is the pdf
executing a search
deleting the parent and child docs
repeat

After looping for a short while a shard failure occurs, and a null
pointer exception is logged in elasticsearch.

Please take a look, as this looks to be a legitimate issue.

Thanks so much,
Lar

Ok, good call. Indexing the pdf wasn't necessary, although it did
make the problem occur a little more quickly.

Here's a version that doesn't need a pdf. You may need to run it more
than once in a row for the failure to occur, it is intermittent:

Thanks again!
Lar

Also, I just edited it to clean up a few minor details. Should be
good to go now.

Lar

Shay,

Have you been able to duplicate the problem at your end with the
program I posted on gist? This seems like a concurrency bug in
elastic, and has us concerned.

Thanks,
Lar

Not yet, will look at it this week.
On Monday, April 18, 2011 at 7:16 PM, lmader wrote:

Shay,

Have you been able to duplicate the problem at your end with the
program I posted on gist? This seems like a concurrency bug in
elastic, and has us concerned.

Thanks,
Lar

Heya,

Recreated and pushed a fix: Search request intermittent failures with has_child query/filter · Issue #875 · elastic/elasticsearch · GitHub.

-shay.banon
On Tuesday, April 19, 2011 at 12:05 PM, Shay Banon wrote:

Not yet, will look at it this week.
On Monday, April 18, 2011 at 7:16 PM, lmader wrote:

Shay,

Have you been able to duplicate the problem at your end with the
program I posted on gist? This seems like a concurrency bug in
elastic, and has us concerned.

Thanks,
Lar

Awesome, thanks!

Lar