SearchContextMissingException during long scroll/scan operations?

Does anyone else have problems with SearchContextMissingExceptions in scroll/scan operations?

I have logstash indices which each contain 10s of millions of records. I need to walk over
the entire index, processing data from each record. To do this, I'm using the elasticsearch_py
Python library, and Elasticsearch 1.6.0 on a small (4 node) cluster.

Here's my code:

import elasticsearch
import elasticsearch.exceptions 
import elasticsearch.helpers as helpers

 es = elasticsearch.Elasticsearch(['http://XXX.XXX.XXX.108:9200'],retry_on_timeout=True)  

  scanResp = helpers.scan(client=es,scroll="5m",index=index_name,timeout="5m",size=1000)

  resp={}
  for resp in scanResp:
    DO STUFF FOR ONE RECORD

The processing is handling serveral thousand records a second when
it running, so I don't think I'm hitting the 5 minute limit.

After an indeterminate amount of time - sometimes quickly sometimes not,
I get this stack dump. I've formatted the last part for easier reading,
and redacted part of the IP addresses.

Traceback (most recent call last):
  File "/home/ptrei/util/str2int.py", line 190, in <module>
    mymain()
  File "/home/ptrei/util/str2int.py", line 177, in mymain
    process_index(indexname)
   File "/home/ptrei/util/str2int.py", line 112, in process_index
for resp in scanResp:
  File "/usr/lib/python2.6/site-packages/elasticsearch-1.4.0-py2.6.egg/elasticsearch/helpers/__init__.py", line 230, in scan
    resp = client.scroll(scroll_id, scroll=scroll)
  File "/usr/lib/python2.6/site-packages/elasticsearch-1.4.0-py2.6.egg/elasticsearch/client/utils.py", line 68, in _wrapped
return func(*args, params=params, **kwargs)
  File "/usr/lib/python2.6/site-packages/elasticsearch-1.4.0-py2.6.egg/elasticsearch/client/__init__.py", line 616, in scroll
    params=params, body=scroll_id)
  File "/usr/lib/python2.6/site-packages/elasticsearch-1.4.0-py2.6.egg/elasticsearch/transport.py", line 308, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/usr/lib/python2.6/site-packages/elasticsearch-1.4.0-py2.6.egg/elasticsearch/connection/http_urllib3.py", line 86, in perform_request
self._raise_error(response.status, raw_data)
  File "/usr/lib/python2.6/site-packages/elasticsearch-1.4.0-py2.6.egg/elasticsearch/connection/base.py", line 102, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.NotFoundError: TransportError(404, u'{"_scroll_id":"c2NhbjswOzE7dG90YWxfaGl0czozNzkzNTg5ODs=","took":76,"timed_out":false,"_shards"
:{"total":5,"successful":0,"failed":5,"failures":[

{"status":404,"reason":"SearchContextMissingException[No search context found for id [13]]"},
{"status":404,"reason":"RemoteTransportException[[pegasus_101][inet[/XXX.XXX.XXX.101:9300]]
  [indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [15]]; "},

{"status":404,"reason":"SearchContextMissingException[No search context found for id [14]]"},
{"status":404,"reason":"RemoteTransportException[[pegasus_101][inet[/XXX.XXX.XXX.101:9300]]
  [indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [14]]; "},

{"status":404,"reason":"SearchContextMissingException[No search context found for id [15]]"}]

},"hits":{"total":37935898,"max_score":0.0,"hits":[]}}')

My main suspicion is that I'm running this on underpowered hardware (more on the way), but if
anyone has other theories or more insight, I'd love to hear it. Searching shows that similar
problems have been around for a while.

thanks!
Peter

Is your scroll code using the new scroll id which is sent on each scroll response? A common problem here is that people try to use the original scroll id for all scroll requests which would result in errors similar to this.

I'm using elasticsearch-py, a python wrapper for the API. I'm under the impression that this handles the
scroll-id internally - I certainly don't see it myself.

Sometimes, the process runs correctly for many hours before completing, or failing. Failure seems to
vaguely correlate with how heavily the ES cluster is being used, which is why I suggest that my HW
is to blame.

Peter

could it be the scroll id expired?

Jason:
The scroll operation sometimes runs for hours without a problem, at others it fails within a few minutes. I don't thinkits an expiration. I set the timeout to 5 minutes, which is way up from the default 10 seconds. I've tried tinkering with the size of the scroll - dropping to 50 (from 800-1000) seems to help, but not entirely.

We've been having the same issue with our cluster. We are doing many scroll queries with the code between queries taking much less time than the scroll timeout. When the query is searching over a large dataset it has been failing with this error after several hundred iterations.

Did you ever find a solution here?

1 Like

is any solution for this? have anyone tried this on latest version of ES?
for me no reason to use ES if this problem can't be solved.

we're having exactly the same issue. Did anyone find the exact reason or maybe a workaround?

I'm the OP.

Sorry guys, I never found an ES solution, and wound up using Splunk to digest my raw data.
I have since moved on to other projects.

I'm also experiencing this currently, we're on ES 1.7 using the elasticsearch python library.

Even i am facing the same issue.Did someone found a solution for this?!!