Empty Slices with Scan/Scroll


We're currently using Spark with es-hadoop to read 1 million documents from an Elasticsearch index. The sliced scan scroll that it is using internally is not evenly distributing the results across the slices; in fact all but 1 slice is empty for each shard preferenced scan scroll;

  • Elasticsearch 6.0.0
  • Single node cluster for testing
  • 5 Shards
  • 1 million documents in an index
  • es.input.max.docs.per.partition set to 50k
  1. Match All query is run to obtain all documents from the index; this results in 5 scan scrolls with 4 slices each.
  2. Of the 4 slices in each scan scroll only 1 of them contains any results (~ 200k).

Is there anyway to evenly distribute the results across the slices? I believe this may be the same issue as that described in https://github.com/elastic/elasticsearch/issues/27550.



It's possible that you are running into that linked issue. If that is the case, there's not much that we can do in terms of balancing the sliced scrolls. You could set the es.input.max.docs.per.parition setting to a really high number like MAX_INT which should effectively disable the slicing features. This would only eliminate the overhead of those empty tasks from the empty slices.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.