Throttling of Elastic Web Crawler

sebastianboelling · September 13, 2023, 1:18pm

Hi,

we are using the Elastic Web Crawler. Is there a way or an idea to limit or throttle the requests the Crawler makes againts a Web datasource (a domain)? For example max 1 request per second ro wait x miliseconds between requests ...

Best regards

Sebastian

Sean_Story · September 13, 2023, 1:31pm

Unfortunately, the crawler doesn't currently support the robots.txt directive for crawl-delay, which would be a common way to do what you're looking for. If you have a support relationship with Elastic, you can request that feature as an Enhancement Request.

Other than that, you can try to just reduce the parallelism with which the crawler operates, to try to reduce load on your site. Configs like:

connector.crawler.workers.pool_size.limit
connector.crawler.crawl.threads.limit

can help with this.

Finally, ensuring that your site responds with 429 errors with the Retry-After header can ensure that the crawler does not over-saturate your request pool.

system · October 11, 2023, 1:31pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.