a quick question regarding the
scroll_size parameter of a datafeed in an anomaly job.
scroll_size just limiting the number of results per query returned, but every document is processed (thus a reduced scroll_size implies increased number of query executions)?
Or, let's say if we have 1000 doc's in a time range equal the bucket size of a job, and the
scroll_size is defined as 750, does it only process 750 documents and proceeds to the next bucket (without processing the 250 remaining documents in that bucket).
Thank you in advance!
To answer my own question, for those interested:
scroll_size: In most cases, the type of search that the datafeed executes to Elasticsearch uses the scroll API. Scroll size defines how much the datafeed queries to Elasticsearch at a time. For example, if the datafeed is set to query for log data every 5 minutes, but in a typical 5-minute window there are 1 million events, the idea of scrolling that data means that not all 1 million events will be expected to be fetched with one giant query. Rather, it will do it with many queries in increments of
scroll_size. By default, this scroll size is set conservatively to 1,000. So, to get 1 million records returned to ML, the datafeed will ask Elasticsearch for 1,000 rows, a thousand times. Increasing
scroll_size to 10,000 will make the number of scrolls be reduced to a hundred. In general, beefier clusters should be able to handle a larger
scroll_size and thus be more efficient in the overall process."
Source: Machine Learning with the Elastic Stack
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.