Intermittent shard failures with "has_child" type queries

Lar_Mader · April 11, 2011, 11:23pm

I'm having a problem with intermittent shard failures (doesn't always
happen) when executing "has_child" type queries against parent/child
documents.

This comes up when I am running a set of functional tests against my
application. The tests create some documents, index them, update the
documents, re-index them, search on them, delete documents, etc.

Most of the time the tests run successfully, but occasionally the
search fails. When it fails, the search doesn't throw an exception,
but it doesn't find the document, and the response contains shard
failures.

However once it gets in the state where the search returns shard
failures, it consistently fails the search. It is only intermittent
in the sense that most of the time the tests succeed. Here are the
shard failures:

[shard [[8ZUq44E3QUu8XPbPvb9yfA][acme][2]], reason
[RemoteTransportException[[Ardina][inet[/10.10.30.52:9300]][search/
phase/query]]; nested: QueryPhaseExecutionException[[acme][2]:
query[title:someTitle ConstantScore(child_filter[contentFiles/content]
(filtered(file:test)-

FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:
Query Failed [Failed to execute child query [filtered(file:test)-
FilterCacheFilterWrapper(_type:contentFiles)]]]; nested: ]]

Any idea what could be causing this?

Is there anything to be careful of when re-indexing the parent and
child documents separately? Routing?

Thanks,
Lar

Lar_Mader · April 11, 2011, 11:54pm

More info:
I check the elasticsearch log and see the following:

[2011-04-11 16:51:09,892][DEBUG][action.search.type ] [Ardina]
[acme][2], node[8ZUq44E3QUu8XPbPvb9yfA], [P], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4dae0a]
org.elasticsearch.search.query.QueryPhaseExecutionException: [acme]
[2]: query[ConstantScore(child_filter[contentFiles/content]
(filtered(file:mission file:statement)-

FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:
Query Failed [Failed to execute child query [filtered(file:mission
file:statement)->FilterCacheFilterWrapper(_type:contentFiles)]]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:

at

org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:
169)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:
132)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:
76)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:192)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.access$000(TransportSearchTypeAction.java:75)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction$1.run(TransportSearchTypeAction.java:151)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at
org.elasticsearch.index.query.type.child.ChildCollector.collect(ChildCollector.java:
75)
at org.apache.lucene.search.Scorer.score(Scorer.java:62)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:
212)
at
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:
159)
at org.apache.lucene.search.Searcher.search(Searcher.java:67)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:
143)
... 9 more

On Apr 11, 4:23 pm, lmader lmaderintre...@gmail.com wrote:

I'm having a problem with intermittent shard failures (doesn't always
happen) when executing "has_child" type queries against parent/child
documents.

This comes up when I am running a set of functional tests against my
application. The tests create some documents, index them, update the
documents, re-index them, search on them, delete documents, etc.

Most of the time the tests run successfully, but occasionally the
search fails. When it fails, the search doesn't throw an exception,
but it doesn't find the document, and the response contains shard
failures.

However once it gets in the state where the search returns shard
failures, it consistently fails the search. It is only intermittent
in the sense that most of the time the tests succeed. Here are the
shard failures:

[shard [[8ZUq44E3QUu8XPbPvb9yfA][acme][2]], reason
[RemoteTransportException[[Ardina][inet[/10.10.30.52:9300]][search/
phase/query]]; nested: QueryPhaseExecutionException[[acme][2]:
query[title:someTitle ConstantScore(child_filter[contentFiles/content]
(filtered(file:test)->FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:

Query Failed [Failed to execute child query [filtered(file:test)-

FilterCacheFilterWrapper(_type:contentFiles)]]]; nested: ]]

Any idea what could be causing this?

Is there anything to be careful of when re-indexing the parent and
child documents separately? Routing?

Thanks,
Lar

kimchy · April 12, 2011, 9:25am

Heya,

I can see where it happens, but not sure why. Can you provide a recreation (even if intermediately) that I can run?

-shay.banon
On Tuesday, April 12, 2011 at 2:54 AM, lmader wrote:

More info:
I check the elasticsearch log and see the following:

[2011-04-11 16:51:09,892][DEBUG][action.search.type ] [Ardina]
[acme][2], node[8ZUq44E3QUu8XPbPvb9yfA], [P], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4dae0a]
org.elasticsearch.search.query.QueryPhaseExecutionException: [acme]
[2]: query[ConstantScore(child_filter[contentFiles/content]
(filtered(file:mission file:statement)-

FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:
Query Failed [Failed to execute child query [filtered(file:mission
file:statement)->FilterCacheFilterWrapper(_type:contentFiles)]]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:

at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:
169)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:
132)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:
76)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:192)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.access$000(TransportSearchTypeAction.java:75)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction$1.run(TransportSearchTypeAction.java:151)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at
org.elasticsearch.index.query.type.child.ChildCollector.collect(ChildCollector.java:
75)
at org.apache.lucene.search.Scorer.score(Scorer.java:62)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:
212)
at
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:
159)
at org.apache.lucene.search.Searcher.search(Searcher.java:67)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:
143)
... 9 more

On Apr 11, 4:23 pm, lmader lmaderintre...@gmail.com wrote:

I'm having a problem with intermittent shard failures (doesn't always
happen) when executing "has_child" type queries against parent/child
documents.

This comes up when I am running a set of functional tests against my
application. The tests create some documents, index them, update the
documents, re-index them, search on them, delete documents, etc.

Most of the time the tests run successfully, but occasionally the
search fails. When it fails, the search doesn't throw an exception,
but it doesn't find the document, and the response contains shard
failures.

However once it gets in the state where the search returns shard
failures, it consistently fails the search. It is only intermittent
in the sense that most of the time the tests succeed. Here are the
shard failures:

[shard [[8ZUq44E3QUu8XPbPvb9yfA][acme][2]], reason
[RemoteTransportException[[Ardina][inet[/10.10.30.52:9300]][search/
phase/query]]; nested: QueryPhaseExecutionException[[acme][2]:
query[title:someTitle ConstantScore(child_filter[contentFiles/content]
(filtered(file:test)->FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:

Query Failed [Failed to execute child query [filtered(file:test)-

FilterCacheFilterWrapper(_type:contentFiles)]]]; nested: ]]

Any idea what could be causing this?

Is there anything to be careful of when re-indexing the parent and
child documents separately? Routing?

Thanks,
Lar

Lar_Mader · April 12, 2011, 6:17pm

Thanks for the speedy reply!

What would you need to run this? I can send you the elasticsearch
data files when it is in this state, along with the query (and any
other supporting files). Let me know what you need and where to send
it.

Thanks so much!
Lar

On Apr 12, 2:25 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I can see where it happens, but not sure why. Can you provide a recreation (even if intermediately) that I can run?

-shay.banon

On Tuesday, April 12, 2011 at 2:54 AM, lmader wrote:

More info:
I check the elasticsearch log and see the following:

[2011-04-11 16:51:09,892][DEBUG][action.search.type ] [Ardina]
[acme][2], node[8ZUq44E3QUu8XPbPvb9yfA], [P], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4dae0a]
org.elasticsearch.search.query.QueryPhaseExecutionException: [acme]
[2]: query[ConstantScore(child_filter[contentFiles/content]
(filtered(file:mission file:statement)-

FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:
Query Failed [Failed to execute child query [filtered(file:mission
file:statement)->FilterCacheFilterWrapper(_type:contentFiles)]]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:

at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java :
169)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQue ry(SearchServiceTransportAction.java:
132)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java :
76)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:192)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.access$000(TransportSearchTypeAction.java:75)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction$1.run(TransportSearchTypeAction.java:151)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at
org.elasticsearch.index.query.type.child.ChildCollector.collect(ChildCollec tor.java:
75)
at org.apache.lucene.search.Scorer.score(Scorer.java:62)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:
212)
at
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexS earcher.java:
159)
at org.apache.lucene.search.Searcher.search(Searcher.java:67)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:
143)
... 9 more

On Apr 11, 4:23 pm, lmader lmaderintre...@gmail.com wrote:

I'm having a problem with intermittent shard failures (doesn't always
happen) when executing "has_child" type queries against parent/child
documents.

This comes up when I am running a set of functional tests against my
application. The tests create some documents, index them, update the
documents, re-index them, search on them, delete documents, etc.

Most of the time the tests run successfully, but occasionally the
search fails. When it fails, the search doesn't throw an exception,
but it doesn't find the document, and the response contains shard
failures.

However once it gets in the state where the search returns shard
failures, it consistently fails the search. It is only intermittent
in the sense that most of the time the tests succeed. Here are the
shard failures:

[shard [[8ZUq44E3QUu8XPbPvb9yfA][acme][2]], reason
[RemoteTransportException[[Ardina][inet[/10.10.30.52:9300]][search/
phase/query]]; nested: QueryPhaseExecutionException[[acme][2]:
query[title:someTitle ConstantScore(child_filter[contentFiles/content]
(filtered(file:test)->FilterCacheFilterWrapper(_type:contentFiles)))],from[ 0],size[10]:

Query Failed [Failed to execute child query [filtered(file:test)-

FilterCacheFilterWrapper(_type:contentFiles)]]]; nested: ]]

Any idea what could be causing this?

Is there anything to be careful of when re-indexing the parent and
child documents separately? Routing?

Thanks,
Lar

kimchy · April 12, 2011, 9:34pm

The simplest way would be something like a curl recreation where it first index some data, and then run the query (possibly multiple times) to create the error. This will allow me to convert it into an integration test and fix it, and also make sure it will never happen again :).

If you can try and work on the above, it would be very helpful. If its a no go, then ping me privately on IRC where I can download the data files and I can try and recreate it that way.

-shay.banon
On Tuesday, April 12, 2011 at 9:17 PM, lmader wrote:

Thanks for the speedy reply!

What would you need to run this? I can send you the elasticsearch
data files when it is in this state, along with the query (and any
other supporting files). Let me know what you need and where to send
it.

Thanks so much!
Lar

On Apr 12, 2:25 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I can see where it happens, but not sure why. Can you provide a recreation (even if intermediately) that I can run?

-shay.banon

On Tuesday, April 12, 2011 at 2:54 AM, lmader wrote:

More info:
I check the elasticsearch log and see the following:

[2011-04-11 16:51:09,892][DEBUG][action.search.type ] [Ardina]
[acme][2], node[8ZUq44E3QUu8XPbPvb9yfA], [P], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4dae0a]
org.elasticsearch.search.query.QueryPhaseExecutionException: [acme]
[2]: query[ConstantScore(child_filter[contentFiles/content]
(filtered(file:mission file:statement)-

FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:
Query Failed [Failed to execute child query [filtered(file:mission
file:statement)->FilterCacheFilterWrapper(_type:contentFiles)]]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:

at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java :
169)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQue ry(SearchServiceTransportAction.java:
132)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java :
76)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:192)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.access$000(TransportSearchTypeAction.java:75)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction$1.run(TransportSearchTypeAction.java:151)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at
org.elasticsearch.index.query.type.child.ChildCollector.collect(ChildCollec tor.java:
75)
at org.apache.lucene.search.Scorer.score(Scorer.java:62)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:
212)
at
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexS earcher.java:
159)
at org.apache.lucene.search.Searcher.search(Searcher.java:67)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:
143)
... 9 more

On Apr 11, 4:23 pm, lmader lmaderintre...@gmail.com wrote:

I'm having a problem with intermittent shard failures (doesn't always
happen) when executing "has_child" type queries against parent/child
documents.

This comes up when I am running a set of functional tests against my
application. The tests create some documents, index them, update the
documents, re-index them, search on them, delete documents, etc.

Most of the time the tests run successfully, but occasionally the
search fails. When it fails, the search doesn't throw an exception,
but it doesn't find the document, and the response contains shard
failures.

However once it gets in the state where the search returns shard
failures, it consistently fails the search. It is only intermittent
in the sense that most of the time the tests succeed. Here are the
shard failures:

[shard [[8ZUq44E3QUu8XPbPvb9yfA][acme][2]], reason
[RemoteTransportException[[Ardina][inet[/10.10.30.52:9300]][search/
phase/query]]; nested: QueryPhaseExecutionException[[acme][2]:
query[title:someTitle ConstantScore(child_filter[contentFiles/content]
(filtered(file:test)->FilterCacheFilterWrapper(_type:contentFiles)))],from[ 0],size[10]:

Query Failed [Failed to execute child query [filtered(file:test)-

FilterCacheFilterWrapper(_type:contentFiles)]]]; nested: ]]

Any idea what could be causing this?

Is there anything to be careful of when re-indexing the parent and
child documents separately? Routing?

Thanks,
Lar

Lar_Mader · April 13, 2011, 2:38am

Ok, I think I understand what is happening now.

The error condition that I was seeing is reproducible with the
following sequence of events:

Index a parent document
Index a child document of the parent from step 1.
Delete the parent document
Execute a has_child search. This will return a shard failure and
the log file shows the null pointer exception.

Well, it makes sense that this would represent an error condition.
The situation comes up somehow with our functional tests that are
doing a lot of indexing, re-indexing, and deleting, and it seems that
somehow the deletes are getting interleaved with the updates.

I think what is happening in our app test is that a delete operation
is underway that is deleting the child and then the parent, but an
indexing operation snuck in between and indexed the parent and the
child, like this:

On thread 1 - Delete Child document executes, but before the Delete
Parent starts...
On another thread the same child gets indexed. Now the child doc
is back.
Now, back on thread 1 - the Delete Parent operation executes. Now
the child is orphaned, and the has_child search will fail.

I think we can make changes in our app to avoid this. If we delete
the parent first, then the above scenario wouldn't result in the shard
failure, and just an orphaned parent instead, which isn't so bad.

Are there other strategies to avoid this? Is there a way to execute a
delete against both the child and the parent in an atomic fashion, or
perhaps a way to have the deletion of the parent automatically delete
the child?

Kind regards,
Lar

On Apr 12, 2:34 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

The simplest way would be something like a curl recreation where it first index some data, and then run the query (possibly multiple times) to create the error. This will allow me to convert it into an integration test and fix it, and also make sure it will never happen again :).

If you can try and work on the above, it would be very helpful. If its a no go, then ping me privately on IRC where I can download the data files and I can try and recreate it that way.

-shay.banon

On Tuesday, April 12, 2011 at 9:17 PM, lmader wrote:

Thanks for the speedy reply!

What would you need to run this? I can send you the elasticsearch
data files when it is in this state, along with the query (and any
other supporting files). Let me know what you need and where to send
it.

Thanks so much!
Lar

On Apr 12, 2:25 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I can see where it happens, but not sure why. Can you provide a recreation (even if intermediately) that I can run?

-shay.banon

On Tuesday, April 12, 2011 at 2:54 AM, lmader wrote:

More info:
I check the elasticsearch log and see the following:

[2011-04-11 16:51:09,892][DEBUG][action.search.type ] [Ardina]
[acme][2], node[8ZUq44E3QUu8XPbPvb9yfA], [P], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4dae0a]
org.elasticsearch.search.query.QueryPhaseExecutionException: [acme]
[2]: query[ConstantScore(child_filter[contentFiles/content]
(filtered(file:mission file:statement)-

FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:
Query Failed [Failed to execute child query [filtered(file:mission
file:statement)->FilterCacheFilterWrapper(_type:contentFiles)]]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:

at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java :
169)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQue ry(SearchServiceTransportAction.java:
132)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java :
76)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:192)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.access$000(TransportSearchTypeAction.java:75)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction$1.run(TransportSearchTypeAction.java:151)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at
org.elasticsearch.index.query.type.child.ChildCollector.collect(ChildCollec tor.java:
75)
at org.apache.lucene.search.Scorer.score(Scorer.java:62)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:
212)
at
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexS earcher.java:
159)
at org.apache.lucene.search.Searcher.search(Searcher.java:67)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:
143)
... 9 more

On Apr 11, 4:23 pm, lmader lmaderintre...@gmail.com wrote:

I'm having a problem with intermittent shard failures (doesn't always
happen) when executing "has_child" type queries against parent/child
documents.

This comes up when I am running a set of functional tests against my
application. The tests create some documents, index them, update the
documents, re-index them, search on them, delete documents, etc.

Most of the time the tests run successfully, but occasionally the
search fails. When it fails, the search doesn't throw an exception,
but it doesn't find the document, and the response contains shard
failures.

However once it gets in the state where the search returns shard
failures, it consistently fails the search. It is only intermittent
in the sense that most of the time the tests succeed. Here are the
shard failures:

[shard [[8ZUq44E3QUu8XPbPvb9yfA][acme][2]], reason
[RemoteTransportException[[Ardina][inet[/10.10.30.52:9300]][search/
phase/query]]; nested: QueryPhaseExecutionException[[acme][2]:
query[title:someTitle ConstantScore(child_filter[contentFiles/content]
(filtered(file:test)->FilterCacheFilterWrapper(_type:contentFiles)))],from[ 0],size[10]:

Query Failed [Failed to execute child query [filtered(file:test)-

FilterCacheFilterWrapper(_type:contentFiles)]]]; nested: ]]

Any idea what could be causing this?

Is there anything to be careful of when re-indexing the parent and
child documents separately? Routing?

Thanks,
Lar

kimchy · April 13, 2011, 8:18am

Can you recreate the first scenario, it should not fail in this case, just ignore those docs with no parents.
On Wednesday, April 13, 2011 at 5:38 AM, lmader wrote:

Ok, I think I understand what is happening now.

The error condition that I was seeing is reproducible with the
following sequence of events:

Index a parent document

Index a child document of the parent from step 1.

Delete the parent document

Execute a has_child search. This will return a shard failure and
the log file shows the null pointer exception.

Well, it makes sense that this would represent an error condition.
The situation comes up somehow with our functional tests that are
doing a lot of indexing, re-indexing, and deleting, and it seems that
somehow the deletes are getting interleaved with the updates.

I think what is happening in our app test is that a delete operation
is underway that is deleting the child and then the parent, but an
indexing operation snuck in between and indexed the parent and the
child, like this:

On thread 1 - Delete Child document executes, but before the Delete
Parent starts...

On another thread the same child gets indexed. Now the child doc
is back.

Now, back on thread 1 - the Delete Parent operation executes. Now
the child is orphaned, and the has_child search will fail.

I think we can make changes in our app to avoid this. If we delete
the parent first, then the above scenario wouldn't result in the shard
failure, and just an orphaned parent instead, which isn't so bad.

Are there other strategies to avoid this? Is there a way to execute a
delete against both the child and the parent in an atomic fashion, or
perhaps a way to have the deletion of the parent automatically delete
the child?

Kind regards,
Lar

On Apr 12, 2:34 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

The simplest way would be something like a curl recreation where it first index some data, and then run the query (possibly multiple times) to create the error. This will allow me to convert it into an integration test and fix it, and also make sure it will never happen again :).

If you can try and work on the above, it would be very helpful. If its a no go, then ping me privately on IRC where I can download the data files and I can try and recreate it that way.

-shay.banon

On Tuesday, April 12, 2011 at 9:17 PM, lmader wrote:

Thanks for the speedy reply!

What would you need to run this? I can send you the elasticsearch
data files when it is in this state, along with the query (and any
other supporting files). Let me know what you need and where to send
it.

Thanks so much!
Lar

On Apr 12, 2:25 am, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I can see where it happens, but not sure why. Can you provide a recreation (even if intermediately) that I can run?

-shay.banon

On Tuesday, April 12, 2011 at 2:54 AM, lmader wrote:

More info:
I check the elasticsearch log and see the following:

[2011-04-11 16:51:09,892][DEBUG][action.search.type ] [Ardina]
[acme][2], node[8ZUq44E3QUu8XPbPvb9yfA], [P], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4dae0a]
org.elasticsearch.search.query.QueryPhaseExecutionException: [acme]
[2]: query[ConstantScore(child_filter[contentFiles/content]
(filtered(file:mission file:statement)-

FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:
Query Failed [Failed to execute child query [filtered(file:mission
file:statement)->FilterCacheFilterWrapper(_type:contentFiles)]]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:

at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java :
169)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQue ry(SearchServiceTransportAction.java:
132)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java :
76)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:192)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.access$000(TransportSearchTypeAction.java:75)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction$1.run(TransportSearchTypeAction.java:151)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at
org.elasticsearch.index.query.type.child.ChildCollector.collect(ChildCollec tor.java:
75)
at org.apache.lucene.search.Scorer.score(Scorer.java:62)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:
212)
at
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexS earcher.java:
159)
at org.apache.lucene.search.Searcher.search(Searcher.java:67)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:
143)
... 9 more

On Apr 11, 4:23 pm, lmader lmaderintre...@gmail.com wrote:

I'm having a problem with intermittent shard failures (doesn't always
happen) when executing "has_child" type queries against parent/child
documents.

This comes up when I am running a set of functional tests against my
application. The tests create some documents, index them, update the
documents, re-index them, search on them, delete documents, etc.

Most of the time the tests run successfully, but occasionally the
search fails. When it fails, the search doesn't throw an exception,
but it doesn't find the document, and the response contains shard
failures.

However once it gets in the state where the search returns shard
failures, it consistently fails the search. It is only intermittent
in the sense that most of the time the tests succeed. Here are the
shard failures:

[shard [[8ZUq44E3QUu8XPbPvb9yfA][acme][2]], reason
[RemoteTransportException[[Ardina][inet[/10.10.30.52:9300]][search/
phase/query]]; nested: QueryPhaseExecutionException[[acme][2]:
query[title:someTitle ConstantScore(child_filter[contentFiles/content]
(filtered(file:test)->FilterCacheFilterWrapper(_type:contentFiles)))],from[ 0],size[10]:

Query Failed [Failed to execute child query [filtered(file:test)-

FilterCacheFilterWrapper(_type:contentFiles)]]]; nested: ]]

Any idea what could be causing this?

Is there anything to be careful of when re-indexing the parent and
child documents separately? Routing?

Thanks,
Lar

Lar_Mader · April 14, 2011, 7:11pm

Ok, I've created a fairly simple java program that reliably reproduces
the problem.

The program is available on gist at: https://gist.github.com/920209

The program needs to load a pdf document. Probably any pdf would
suffice but I've uploaded the one I used for testing this here:

---- begin really long url -----

http://ym6oxg.sn2.livefilestore.com/y1px5Kd8OyeW32v4CjbjM-db6pYUXb4Llr496sP_0Q9wpXTICejS9Huq0Y8byQRn8B_tnruYPXneq_gvDPuNDMA7-0P4KW9wGC6/document1.pdf?download&psid=1
---- end really long url ----

The program essentially just loops on:
creating a parent and child doc, where the child is the pdf
executing a search
deleting the parent and child docs
repeat

After looping for a short while a shard failure occurs, and a null
pointer exception is logged in elasticsearch.

Please take a look, as this looks to be a legitimate issue.

Thanks so much,
Lar

kimchy · April 14, 2011, 7:18pm

Thanks for the effort. Not sure why the test case needs to index a pdf, can you simplify it so it just indexes some text and thats it? It will be simpler for me to turn it into a testcase in ES.
On Thursday, April 14, 2011 at 10:11 PM, lmader wrote:

Ok, I've created a fairly simple java program that reliably reproduces
the problem.

The program is available on gist at: Demonstrates a possible concurrency bug in the elasticsearch parent\child feature · GitHub

The program needs to load a pdf document. Probably any pdf would
suffice but I've uploaded the one I used for testing this here:

---- begin really long url -----

http://ym6oxg.sn2.livefilestore.com/y1px5Kd8OyeW32v4CjbjM-db6pYUXb4Llr496sP_0Q9wpXTICejS9Huq0Y8byQRn8B_tnruYPXneq_gvDPuNDMA7-0P4KW9wGC6/document1.pdf?download&psid=1
---- end really long url ----

The program essentially just loops on:
creating a parent and child doc, where the child is the pdf
executing a search
deleting the parent and child docs
repeat

After looping for a short while a shard failure occurs, and a null
pointer exception is logged in elasticsearch.

Please take a look, as this looks to be a legitimate issue.

Thanks so much,
Lar

Lar_Mader · April 14, 2011, 8:07pm

Ok, good call. Indexing the pdf wasn't necessary, although it did
make the problem occur a little more quickly.

Here's a version that doesn't need a pdf. You may need to run it more
than once in a row for the failure to occur, it is intermittent:

gist.github.com

https://gist.github.com/lmader/920353

ElasticTest.java


/**
 * 
 * Also see comments on the function main().
 * 
 * Depends only on the libs that come with elasticsearch:
 *    elasticsearch-0.15.2.jar
 *    lucene*.jar
 *  
 *  Place the ElasticTest.java file and the above jars in the same folder.

This file has been truncated. show original

Thanks again!
Lar

Lar_Mader · April 14, 2011, 8:49pm

Also, I just edited it to clean up a few minor details. Should be
good to go now.

gist.github.com

https://gist.github.com/lmader/920353

ElasticTest.java


/**
 * 
 * Also see comments on the function main().
 * 
 * Depends only on the libs that come with elasticsearch:
 *    elasticsearch-0.15.2.jar
 *    lucene*.jar
 *  
 *  Place the ElasticTest.java file and the above jars in the same folder.

This file has been truncated. show original

Lar

Lar_Mader · April 18, 2011, 4:16pm

Shay,

Have you been able to duplicate the problem at your end with the
program I posted on gist? This seems like a concurrency bug in
elastic, and has us concerned.

Thanks,
Lar

kimchy · April 19, 2011, 9:05am

Not yet, will look at it this week.
On Monday, April 18, 2011 at 7:16 PM, lmader wrote:

Shay,

Have you been able to duplicate the problem at your end with the
program I posted on gist? This seems like a concurrency bug in
elastic, and has us concerned.

Thanks,
Lar

kimchy · April 21, 2011, 2:55pm

Heya,

Recreated and pushed a fix: Search request intermittent failures with has_child query/filter · Issue #875 · elastic/elasticsearch · GitHub.

-shay.banon
On Tuesday, April 19, 2011 at 12:05 PM, Shay Banon wrote:

Not yet, will look at it this week.
On Monday, April 18, 2011 at 7:16 PM, lmader wrote:

Shay,

Have you been able to duplicate the problem at your end with the
program I posted on gist? This seems like a concurrency bug in
elastic, and has us concerned.

Thanks,
Lar

Lar_Mader · April 21, 2011, 4:58pm

Awesome, thanks!

Lar

Topic		Replies	Views
Two different shard exceptions Elasticsearch	16	546	July 6, 2017
Nullpointerexception on has_child query Elasticsearch	2	466	July 5, 2017
Shard failure when scrolling - invalid results, but no error reported Elasticsearch	2	1740	July 6, 2017
NullPointerException using "has_child" filter after upgrade to v0.90.5 Elasticsearch	2	318	July 6, 2017
Problem with "Has Child Filter" Elasticsearch	3	288	July 6, 2017

Intermittent shard failures with "has_child" type queries

Related topics