Varying numbers of results from scan and scroll


#1

I posted this on github already (https://github.com/elastic/elasticsearch/issues/16555), but I found issues there pointing to this forum as the better place to ask...here's what I posted:

Right off the bat, here's a little info:

$ uname -a
Linux jj-big-box 3.19.0-49-generic #55~14.04.1-Ubuntu SMP Fri Jan 22 11:24:31 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ curl -XGET 'localhost:9200'
{
  "status" : 200,
  "name" : "Bast",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "1.7.4",
    "build_hash" : "0d3159b9fc8bc8e367c5c40c09c2a57c0032b32e",
    "build_timestamp" : "2015-12-15T11:25:18Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.4"
  },
  "tagline" : "You Know, for Search"
}

I've seen some similar posts, but I've had trouble squaring their results with mine. I've noticed that I have do not receive consistent numbers of documents when running scan and scroll in elastic search. Here is python code exhibiting the behavior (hopefully the use of sockets is not too confusing...at first I was trying to make sure the problem had nothing to do with elasticsearch-py and that's why I went the route of raw code):

import socket
import httplib
import json
import re

HOST = 'localhost'
PORT = 9200

CRLF = "\r\n\r\n"

init_msg = """
GET /index/document/_search?search_type=scan&scroll=15m&timeout=30&size=10 HTTP/1.1
Host: localhost:9200
Accept-Encoding: identity
Content-Length: 94
connection: keep-alive

{"query": {"regexp": {"date_publ": "2001.*"}}, "_source": ["doc_id", "date_publ", "abstract"]}
"""

scroll_msg = """
GET /_search/scroll?scroll=15m HTTP/1.1
Host: localhost:9200
Accept-Encoding: identity
Content-Length: {sid_length}
connection: keep-alive

{sid}
"""

def get_stream(host, port, verbose=True):
    # Set up the socket.
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    s.connect((HOST, PORT))
    s.send(init_msg)

    # Fetch scroll_id and total number of hits.
    data = s.recv(4096)
    payload = json.loads(data.split(CRLF)[-1])
    sid = payload['_scroll_id']
    total_hits = payload['hits']['total']

    if verbose:
        print "Total hits: {}".format(total_hits)

    # Iterate through results.
    while True:
        # Send data request.
        msg = scroll_msg.format(sid=sid, sid_length=len(sid))
        s.send(msg)

        # Fetch the response body.
        data = s.recv(1024)
        header, body = data.split(CRLF)
        content_length = int(re.findall('Content-Length: (\d*)', header)[0])
        while len(body) < content_length:
            body += s.recv(1024)

        # Extract results from response body.
        payload = json.loads(body)
        sid = payload['_scroll_id']
        hits = payload['hits']['hits']

        #print payload['_shards']

        if not hits:
            break

        for hit in hits:
            yield hit


for count, _ in enumerate(get_stream(HOST, PORT), 1): pass

print count

When I run that a few times, I get the following:

$ python new_test.py 
Total hits: 56366
11650
$ python new_test.py 
Total hits: 56366
24550
$ python new_test.py 
Total hits: 56366
8550

Now if I un-comment the line #print payload['_shards'], the ended up being the following during one run:

Total hits: 56366
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}

...

{u'successful': 4, u'failed': 0, u'total': 4}
{u'successful': 4, u'failed': 0, u'total': 4}
{u'successful': 4, u'failed': 0, u'total': 4}
{u'successful': 4, u'failed': 0, u'total': 4}
{u'successful': 4, u'failed': 0, u'total': 4}
{u'successful': 4, u'failed': 0, u'total': 4}
28110

and ended up as the following the next run:

Total hits: 56366
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 5, u'failed': 0, u'total': 5}

...

{u'successful': 5, u'failed': 0, u'total': 5}
{u'successful': 3, u'failed': 0, u'total': 3}
{u'successful': 1, u'failed': 0, u'total': 1}
{u'successful': 0, u'failed': 0, u'total': 0}
56366

Note: The last run apparently returned all documents. This is the first time I've seen this during this experimentation.

Does anyone have any idea what's going on here? As far as I can tell, I never run into these issues when not doing the regular expression as part of the search, but other than that I'm at a loss.

Thanks for any help!


(system) #2