Hi,
I am using ElasticSearch v1.3.4 and use Spring Data (spring-data-elasticsearch v1.1.2.RELEASE) to access the index. I need to scan ids of all indexed documents. There are over 10 million indexed documents in 5 shards. I only need the index id and not the indexed document source. I am using scan and scroll API. There are two issues:
- The results are not limited to specified fields only and the entire indexed document is being returned.
- The result set size does not match the size specified in the scan request which is results per shard.
Here is my scan request:
GET /my-index/_search?search_type=scan&scroll=1m
{
"fields": [
"_id"
],
"query": {
"match_all": {}
},
"size": 20
}
This returns a scroll id:
{
"_scroll_id": "c2Nhbjs1Ozc2ODA0OTI6SGpobU5kU21RR0d6d1JyeXlRSmdHQTs3OTk5NTkzOmJpRE9zTi1HUWZpQ3NPSUxVZkR2Ymc7NzY4MDQ5MzpIamhtTmRTbVFHR3p3UnJ5eVFKZ0dBOzc5MDQ5OTQ6cmhFMFJlU2VUSE9NaEJvRVJIbWJSQTs3NjgwNDkxOkhqaG1OZFNtUUdHendScnl5UUpnR0E7MTt0b3RhbF9oaXRzOjQ5Mjg2MTQ7",
"took": 13,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4928614,
"max_score": 0,
"hits": []
}
}
which is used in the scroll request:
GET /my-index/_search?scroll=1m&scroll_id=c2Nhbjs1Ozc2ODA0OTI6SGpobU5kU21RR0d6d1JyeXlRSmdHQTs3OTk5NTkzOmJpRE9zTi1HUWZpQ3NPSUxVZkR2Ymc7NzY4MDQ5MzpIamhtTmRTbVFHR3p3UnJ5eVFKZ0dBOzc5MDQ5OTQ6cmhFMFJlU2VUSE9NaEJvRVJIbWJSQTs3NjgwNDkxOkhqaG1OZFNtUUdHendScnl5UUpnR0E7MTt0b3RhbF9oaXRzOjQ5Mjg2MTQ7
However, the result is not as expected and not as described in the documentation (https://www.elastic.co/guide/en/elasticsearch/reference/1.3/search-request-scroll.html):
The question is why does ElasticSearch not honour the 'size' specified in the scan request? The size of results per shard is specified as '20'. Therefore number of expected hits in the search result should be 100. The number of actual hits in the search result is always '10'. Is there a system-wide scan configuration that is overriding the 'size' parameter?
The second question is why does ElasticSearch not honour the 'fields' specified in the scan request? The field specified in the scan request is the document id "_id" only. The expected result should only include the '_id" in the hits. However, the actual result contains full document entity in each hit as well.
I eventually would like to convert these two requests into Spring Data calls using the following two methods:
public String scan(SearchQuery searchQuery, long scrollTimeInMillis, boolean noFields)
and
public <T> Page<T> scroll(String scrollId, long scrollTimeInMillis, SearchResultMapper mapper)
However, the unexpected results are not allowing me to move forward.