a quick question regarding the scroll_size parameter of a datafeed in an anomaly job.
Is the scroll_size just limiting the number of results per query returned, but every document is processed (thus a reduced scroll_size implies increased number of query executions)?
Or, let's say if we have 1000 doc's in a time range equal the bucket size of a job, and the scroll_size is defined as 750, does it only process 750 documents and proceeds to the next bucket (without processing the 250 remaining documents in that bucket).
" scroll_size: In most cases, the type of search that the datafeed executes to Elasticsearch uses the scroll API. Scroll size defines how much the datafeed queries to Elasticsearch at a time. For example, if the datafeed is set to query for log data every 5 minutes, but in a typical 5-minute window there are 1 million events, the idea of scrolling that data means that not all 1 million events will be expected to be fetched with one giant query. Rather, it will do it with many queries in increments of scroll_size. By default, this scroll size is set conservatively to 1,000. So, to get 1 million records returned to ML, the datafeed will ask Elasticsearch for 1,000 rows, a thousand times. Increasing scroll_size to 10,000 will make the number of scrolls be reduced to a hundred. In general, beefier clusters should be able to handle a larger scroll_size and thus be more efficient in the overall process."
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.