Crawl options for Elastic Open Web Crawler

Hello,

I am doing some initial work with the Open Web Crawler and have a few settings I'm looking if they exist yet.

The current Enterprise Search Crawler allows for the following:

connector.crawler.crawl.threads.limit: 1
connector.crawler.crawl.max_crawl_depth.limit: 5
connector.crawler.http.response_size.limit: 80485760
crawler.http.response_size.limit: 80485760

crawler.http.request_timeout: 180
crawler.http.connection_timeout: 180
crawler.http.read_timeout: 180

connector.crawler.http.request_timeout: 180
connector.crawler.http.connection_timeout: 180
connector.crawler.http.read_timeout: 180

I see there is a config allowed for the crawl depth:

## The maximum depth that Crawler will follow links to.
#max_crawl_depth: 2

The other configs I use help with preventing documents that are larger than the 8048... size from being indexed and how long ot wait before giving up on a document probe.

Is this is a related request for that type of functionality: Make ES request settings configurable · Issue #185 · elastic/crawler

And the connector.crawler.crawl.threads.limit: 1 is of particular interest. This allows the current crawler to probe the target at ~1 page per second. The default was found to be agressive at 10 pages in the crawl queue per second, from the web host perspective. Letting it do one link/URL/page/document per second has been effective and low impact on web host(s). The content crawled in this use case is not huge and the time it takes for a crawl to complete is acceptable.

Is the comparable config for the crawl threads limit the same as the bulk_api: max_items=1 setting?

Hi @alongaks

There are configuration options available to Open Crawler to achieve most (or all) of what you're attempting to do. We don't have them documented yet but that will change soon.
For now, if a config is present among these options in the codebase, then it should work.

Some specific configurations that you've asked for:

connector.crawler.crawl.threads.limit can be configured with threads_per_crawl.

The other configs I use help with preventing documents that are larger than the 8048... size from being indexed and how long ot wait before giving up on a document probe.

Among the configs linked above, this set will allow you to configure the HTTP request/response settings when interacting with your website. Note that response_size is configured in bytes.

To configure the size of content being ingested into Elasticsearch, you can use this set of configs. These won't have an impact on the HTTP response size limit when crawling a URL, so if the content extracted is larger than this limit it will just be cut off.

Sorry that there's no current documentation for this. It's on our roadmap to improve this and should be in a better state soon. If you have any further specific questions about these configs, reply here and I'll help you out :slight_smile:

1 Like

Hello, Navarone

Appreciate the info! Very helpful.

I will give these options a look and return with more "asks" if needed. :+1:

1 Like