From Python driver. If I search for this message I find various detailed discussions, but they all seem to assume I'm starting from knowing a lot more than I do.
So, starting from nothing at all, what does this message mean? And then how do I fix it?
so when we do a search in ES we do it in 2 roundtrips. To make sure it's consistent we register a search context on every shard during the first roundtrip. The second roundtrip passes the context ID on to make sure we operate on the same point in time snapshot as the first roundtrip. Now if something closes the search context ie. if it times out (5min by default) you will see this message. It's possible for instance in situations like scan/scroll that if the users is not coming back we might clean up the context in that case. Do you use scan/scroll?
for doc in helpers.scan( es,
index = "filebeat-*",
doc_type = "etp",
query = scanQuery.format( args.since, "classname.keyword" ),
size = 5000,
scroll = "5m",
raise_on_error = False, # Don't know why we sometimes get ScanError otherwise
preserve_order = True):
Is the "default five minutes" you mentioned what the "scroll" parameter is about? Is this five minutes for a single iteration of the loop processing a single document, or for a batch of 5000 (I'm guessing here that the "size" parameter divides the results up into batches of 5000 but I don't see why that should be my business), or for the entirety of the loop processing all the query results?
the 5m is to process a set of 5k results in your case. That is one single roundtrip to ES until you have to get the next batch. ES is not stream based while the Python API might imply that.
Thanks. I think that's now enough information for me to experiment with various timings, batch sizes, error detection and retries. (Which, as ever, will no doubt turn out to be far more work than just getting the original logic right. )
I see this error when the search context has timed out so it's no longer present in the cluster. The helpers.scan function is a generator which hides the underlying multiple trips to elasticsearch. If you are not processing the results fast enough, @TimWard, it can lead to the context being cleaned up in elasticsearch which will make the next request for that scroll_id (which is done internally in the helper) produce this error.
To verify that this is the case you can tun on logging at the beginning of your script to see the individual requests being sent to elasticsearch:
@honzakral I wonder if we should allow scans to renew their context without consuming that way we can make this API more easy to use. Like your client can go back if you are making progress and tell ES to keep things open? just an idea...
hmm, that would definitely be a nice solution but I am a bit worried about the side effects that might not be immediately obvious to users - people that never finish iterating over the generator would then keep a context alive in elasticsearch indefinitely which would definitely be bad.
Ultimately I think the current approach is good in that it promotes the idea that scan/scroll is to be used for quick export of data - not keeping a "cursor" open while you perform expensive operation on every document, taking a long time. If that is the case I feel you should use some form of background processing with a queue and a pool of workers anyway.
To improve the user experience would there be a way to keep a tombstone of a search context to provide the user more accurate info? "Your scroll timed out, try increasing your scroll parameter" would be so much more helpful in this case if the overhead is not too big.
Another option would be the idea of a streaming API to/from elasticsearch where this would be done in the coordinating node (potentially same with bulk), that sounds to me though like more trouble than it's worth...
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.