Ok, I think I understand what is happening now.
The error condition that I was seeing is reproducible with the
following sequence of events:
- Index a parent document
- Index a child document of the parent from step 1.
- Delete the parent document
- Execute a has_child search. This will return a shard failure and
the log file shows the null pointer exception.
Well, it makes sense that this would represent an error condition.
The situation comes up somehow with our functional tests that are
doing a lot of indexing, re-indexing, and deleting, and it seems that
somehow the deletes are getting interleaved with the updates.
I think what is happening in our app test is that a delete operation
is underway that is deleting the child and then the parent, but an
indexing operation snuck in between and indexed the parent and the
child, like this:
- On thread 1 - Delete Child document executes, but before the Delete
Parent starts...
- On another thread the same child gets indexed. Now the child doc
is back.
- Now, back on thread 1 - the Delete Parent operation executes. Now
the child is orphaned, and the has_child search will fail.
I think we can make changes in our app to avoid this. If we delete
the parent first, then the above scenario wouldn't result in the shard
failure, and just an orphaned parent instead, which isn't so bad.
Are there other strategies to avoid this? Is there a way to execute a
delete against both the child and the parent in an atomic fashion, or
perhaps a way to have the deletion of the parent automatically delete
the child?
Kind regards,
Lar
On Apr 12, 2:34 pm, Shay Banon shay.ba...@elasticsearch.com wrote:
The simplest way would be something like a curl recreation where it first index some data, and then run the query (possibly multiple times) to create the error. This will allow me to convert it into an integration test and fix it, and also make sure it will never happen again :).
If you can try and work on the above, it would be very helpful. If its a no go, then ping me privately on IRC where I can download the data files and I can try and recreate it that way.
-shay.banon
On Tuesday, April 12, 2011 at 9:17 PM, lmader wrote:
Thanks for the speedy reply!
What would you need to run this? I can send you the elasticsearch
data files when it is in this state, along with the query (and any
other supporting files). Let me know what you need and where to send
it.
Thanks so much!
Lar
On Apr 12, 2:25 am, Shay Banon shay.ba...@elasticsearch.com wrote:
Heya,
I can see where it happens, but not sure why. Can you provide a recreation (even if intermediately) that I can run?
-shay.banon
On Tuesday, April 12, 2011 at 2:54 AM, lmader wrote:
More info:
I check the elasticsearch log and see the following:
[2011-04-11 16:51:09,892][DEBUG][action.search.type ] [Ardina]
[acme][2], node[8ZUq44E3QUu8XPbPvb9yfA], [P], s[STARTED]: Failed to
execute [org.elasticsearch.action.search.SearchRequest@4dae0a]
org.elasticsearch.search.query.QueryPhaseExecutionException: [acme]
[2]: query[ConstantScore(child_filter[contentFiles/content]
(filtered(file:mission file:statement)-
FilterCacheFilterWrapper(_type:contentFiles)))],from[0],size[10]:
Query Failed [Failed to execute child query [filtered(file:mission
file:statement)->FilterCacheFilterWrapper(_type:contentFiles)]]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:
at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java :
169)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQue ry(SearchServiceTransportAction.java:
132)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction
$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java :
76)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:192)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction.access$000(TransportSearchTypeAction.java:75)
at org.elasticsearch.action.search.type.TransportSearchTypeAction
$BaseAsyncAction$1.run(TransportSearchTypeAction.java:151)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at
org.elasticsearch.index.query.type.child.ChildCollector.collect(ChildCollec tor.java:
75)
at org.apache.lucene.search.Scorer.score(Scorer.java:62)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:
212)
at
org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexS earcher.java:
159)
at org.apache.lucene.search.Searcher.search(Searcher.java:67)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:
143)
... 9 more
On Apr 11, 4:23 pm, lmader lmaderintre...@gmail.com wrote:
I'm having a problem with intermittent shard failures (doesn't always
happen) when executing "has_child" type queries against parent/child
documents.
This comes up when I am running a set of functional tests against my
application. The tests create some documents, index them, update the
documents, re-index them, search on them, delete documents, etc.
Most of the time the tests run successfully, but occasionally the
search fails. When it fails, the search doesn't throw an exception,
but it doesn't find the document, and the response contains shard
failures.
However once it gets in the state where the search returns shard
failures, it consistently fails the search. It is only intermittent
in the sense that most of the time the tests succeed. Here are the
shard failures:
[shard [[8ZUq44E3QUu8XPbPvb9yfA][acme][2]], reason
[RemoteTransportException[[Ardina][inet[/10.10.30.52:9300]][search/
phase/query]]; nested: QueryPhaseExecutionException[[acme][2]:
query[title:someTitle ConstantScore(child_filter[contentFiles/content]
(filtered(file:test)->FilterCacheFilterWrapper(_type:contentFiles)))],from[ 0],size[10]:
Query Failed [Failed to execute child query [filtered(file:test)-
FilterCacheFilterWrapper(_type:contentFiles)]]]; nested: ]]
Any idea what could be causing this?
Is there anything to be careful of when re-indexing the parent and
child documents separately? Routing?
Thanks,
Lar