Crawl options for Elastic Open Web Crawler

alongaks · April 8, 2025, 3:35pm

Hello,

I am doing some initial work with the Open Web Crawler and have a few settings I'm looking if they exist yet.

The current Enterprise Search Crawler allows for the following:

connector.crawler.crawl.threads.limit: 1
connector.crawler.crawl.max_crawl_depth.limit: 5
connector.crawler.http.response_size.limit: 80485760
crawler.http.response_size.limit: 80485760

crawler.http.request_timeout: 180
crawler.http.connection_timeout: 180
crawler.http.read_timeout: 180

connector.crawler.http.request_timeout: 180
connector.crawler.http.connection_timeout: 180
connector.crawler.http.read_timeout: 180

I see there is a config allowed for the crawl depth:

## The maximum depth that Crawler will follow links to.
#max_crawl_depth: 2

The other configs I use help with preventing documents that are larger than the 8048... size from being indexed and how long ot wait before giving up on a document probe.

Is this is a related request for that type of functionality: Make ES request settings configurable · Issue #185 · elastic/crawler

And the connector.crawler.crawl.threads.limit: 1 is of particular interest. This allows the current crawler to probe the target at ~1 page per second. The default was found to be agressive at 10 pages in the crawl queue per second, from the web host perspective. Letting it do one link/URL/page/document per second has been effective and low impact on web host(s). The content crawled in this use case is not huge and the time it takes for a crawl to complete is acceptable.

Is the comparable config for the crawl threads limit the same as the bulk_api: max_items=1 setting?

nfeekery · April 14, 2025, 7:47am

Hi @alongaks

There are configuration options available to Open Crawler to achieve most (or all) of what you're attempting to do. We don't have them documented yet but that will change soon.
For now, if a config is present among these options in the codebase, then it should work.

Some specific configurations that you've asked for:

connector.crawler.crawl.threads.limit can be configured with threads_per_crawl.

The other configs I use help with preventing documents that are larger than the 8048... size from being indexed and how long ot wait before giving up on a document probe.

Among the configs linked above, this set will allow you to configure the HTTP request/response settings when interacting with your website. Note that response_size is configured in bytes.

To configure the size of content being ingested into Elasticsearch, you can use this set of configs. These won't have an impact on the HTTP response size limit when crawling a URL, so if the content extracted is larger than this limit it will just be cut off.

Sorry that there's no current documentation for this. It's on our roadmap to improve this and should be in a better state soon. If you have any further specific questions about these configs, reply here and I'll help you out

alongaks · April 15, 2025, 5:20pm

Hello, Navarone

Appreciate the info! Very helpful.

I will give these options a look and return with more "asks" if needed.

Topic		Replies	Views
Configuring ES to reject > N concurrent requests Elasticsearch	6	1129	July 6, 2017
ES response limited to 10 Elasticsearch	4	8253	July 6, 2017
Concurrent search request to elasticsearch Elasticsearch	7	23450	July 6, 2017
All results, setting result size and scrolling Elasticsearch	2	344	July 6, 2017
Threadpool settings Elasticsearch	1	319	July 6, 2017

Crawl options for Elastic Open Web Crawler

Related topics