Scan with fields and size parameter not returning expected result

Hi,

I am using ElasticSearch v1.3.4 and use Spring Data (spring-data-elasticsearch v1.1.2.RELEASE) to access the index. I need to scan ids of all indexed documents. There are over 10 million indexed documents in 5 shards. I only need the index id and not the indexed document source. I am using scan and scroll API. There are two issues:

  1. The results are not limited to specified fields only and the entire indexed document is being returned.
  2. The result set size does not match the size specified in the scan request which is results per shard.

Here is my scan request:

GET /my-index/_search?search_type=scan&scroll=1m 
{
    "fields": [
       "_id"
    ],
    "query": { 
        "match_all": {}
    },
    "size":  20
}

This returns a scroll id:

{
   "_scroll_id": "c2Nhbjs1Ozc2ODA0OTI6SGpobU5kU21RR0d6d1JyeXlRSmdHQTs3OTk5NTkzOmJpRE9zTi1HUWZpQ3NPSUxVZkR2Ymc7NzY4MDQ5MzpIamhtTmRTbVFHR3p3UnJ5eVFKZ0dBOzc5MDQ5OTQ6cmhFMFJlU2VUSE9NaEJvRVJIbWJSQTs3NjgwNDkxOkhqaG1OZFNtUUdHendScnl5UUpnR0E7MTt0b3RhbF9oaXRzOjQ5Mjg2MTQ7",
   "took": 13,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 4928614,
      "max_score": 0,
      "hits": []
   }
}

which is used in the scroll request:

GET /my-index/_search?scroll=1m&scroll_id=c2Nhbjs1Ozc2ODA0OTI6SGpobU5kU21RR0d6d1JyeXlRSmdHQTs3OTk5NTkzOmJpRE9zTi1HUWZpQ3NPSUxVZkR2Ymc7NzY4MDQ5MzpIamhtTmRTbVFHR3p3UnJ5eVFKZ0dBOzc5MDQ5OTQ6cmhFMFJlU2VUSE9NaEJvRVJIbWJSQTs3NjgwNDkxOkhqaG1OZFNtUUdHendScnl5UUpnR0E7MTt0b3RhbF9oaXRzOjQ5Mjg2MTQ7

However, the result is not as expected and not as described in the documentation (https://www.elastic.co/guide/en/elasticsearch/reference/1.3/search-request-scroll.html):

The question is why does ElasticSearch not honour the 'size' specified in the scan request? The size of results per shard is specified as '20'. Therefore number of expected hits in the search result should be 100. The number of actual hits in the search result is always '10'. Is there a system-wide scan configuration that is overriding the 'size' parameter?

The second question is why does ElasticSearch not honour the 'fields' specified in the scan request? The field specified in the scan request is the document id "_id" only. The expected result should only include the '_id" in the hits. However, the actual result contains full document entity in each hit as well.

I eventually would like to convert these two requests into Spring Data calls using the following two methods:

public String scan(SearchQuery searchQuery, long scrollTimeInMillis, boolean noFields)

and

public <T> Page<T> scroll(String scrollId, long scrollTimeInMillis, SearchResultMapper mapper)

However, the unexpected results are not allowing me to move forward.

I was going to post a similar question, but I'll "add on" to yours... Regarding the scan/scroll size parameter, using Elasticsearch 1.5.2, I observed similar, although somewhat less predictable behavior when scrolling through a large result set.

I had about 10 indices that yielded results, with 2 shards each. There was a total of about 3M hits, a large percentage of which were from one index. Specifying a size of 1000 yielded several pages with about 10k, and the number of hits-per-page decreased over time.

I'd like to be able to have better control over the hits-per-page in scan&scroll, as I do some pre-processing with the hits and often run out of memory when the pages are too large.