How to index only given urls in the Elasticsearch using Open Crawler

nfeekery · June 24, 2024, 8:04am

I did try the option max_crawl_depth: 1 on the config file but it didn't work properly.

In theory the max_crawl_depth: 1 should be enough to do this. Can you describe what happened? How many docs did it end up ingesting?

One possibility is the Crawler can also use the sitemap.xml for content discovery. Do you have this file in your website? If so, you can set sitemap_discovery_disabled: true in the config to ignore it.

If that isn't the case, this is likely a bug.

Topic		Replies	Views
Web crawler (github.com/elastic/crawler) to only fetch specific URLS Elastic Search	2	52	April 7, 2025
Crawling web content Elasticsearch	4	886	July 6, 2017
Suggestions for places to start for a crawler? Elasticsearch	7	1550	July 6, 2017
Crawling web sites and indexing the extracted content Elasticsearch	8	10967	July 6, 2017
How to crawl the weburl using ElastiSearch? Elasticsearch	2	467	March 17, 2018

How to index only given urls in the Elasticsearch using Open Crawler

Related topics