we are using the Elastic Web Crawler. Is there a way or an idea to limit or throttle the requests the Crawler makes againts a Web datasource (a domain)? For example max 1 request per second ro wait x miliseconds between requests ...
Unfortunately, the crawler doesn't currently support the robots.txt directive for crawl-delay, which would be a common way to do what you're looking for. If you have a support relationship with Elastic, you can request that feature as an Enhancement Request.
Other than that, you can try to just reduce the parallelism with which the crawler operates, to try to reduce load on your site. Configs like:
connector.crawler.workers.pool_size.limit
connector.crawler.crawl.threads.limit
can help with this.
Finally, ensuring that your site responds with 429 errors with the Retry-After header can ensure that the crawler does not over-saturate your request pool.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.