Hello,
I am doing some initial work with the Open Web Crawler and have a few settings I'm looking if they exist yet.
The current Enterprise Search Crawler allows for the following:
connector.crawler.crawl.threads.limit: 1
connector.crawler.crawl.max_crawl_depth.limit: 5
connector.crawler.http.response_size.limit: 80485760
crawler.http.response_size.limit: 80485760
crawler.http.request_timeout: 180
crawler.http.connection_timeout: 180
crawler.http.read_timeout: 180
connector.crawler.http.request_timeout: 180
connector.crawler.http.connection_timeout: 180
connector.crawler.http.read_timeout: 180
I see there is a config allowed for the crawl depth:
## The maximum depth that Crawler will follow links to.
#max_crawl_depth: 2
The other configs I use help with preventing documents that are larger than the 8048... size from being indexed and how long ot wait before giving up on a document probe.
Is this is a related request for that type of functionality: Make ES request settings configurable · Issue #185 · elastic/crawler
And the connector.crawler.crawl.threads.limit: 1
is of particular interest. This allows the current crawler to probe the target at ~1 page per second. The default was found to be agressive at 10 pages in the crawl queue per second, from the web host perspective. Letting it do one link/URL/page/document per second has been effective and low impact on web host(s). The content crawled in this use case is not huge and the time it takes for a crawl to complete is acceptable.
Is the comparable config for the crawl threads limit the same as the bulk_api: max_items=1
setting?