How to index only given urls in the Elasticsearch using Open Crawler

Hi, I was just wondering how I can index only those urls that I have given in the crawlerconfig.yml file. Let me explain better, for example I have this config file for the crawler:

domain_allowlist:
  - https://a.com

seed_urls:
  - https//a.com/foo
  - https://a.com/foo/bar
  - https://a.com/baz/12/test

... other configs

I want to just index the 3 given urls in the Elasticsearch and not go farther!
I did try the option max_crawl_depth: 1 on the config file but it didn't work properly.

Hi @jahedi

I did try the option max_crawl_depth: 1 on the config file but it didn't work properly.

In theory the max_crawl_depth: 1 should be enough to do this. Can you describe what happened? How many docs did it end up ingesting?

One possibility is the Crawler can also use the sitemap.xml for content discovery. Do you have this file in your website? If so, you can set sitemap_discovery_disabled: true in the config to ignore it.

If that isn't the case, this is likely a bug.

Hi @nfeekery
I had 3 seed urls and finally just one of them was indexed in the Elasticsearch.
Actually I was reindexing my web pages. I just wanted to reindex 3 pages to update their content in the Elasticsearch and just one of them was updated. I checked the pages based on the last_crawled_at field and just one page was recently updated.

@jahedi I tried to reproduce this by crawling a full site, then specifying only 3 seed URLs and crawling again. I found that the 3 seed URL docs were updated in my case.

Can you provide a few things to help figure out what's going on?

  1. All other configs you've set in the yaml file(s)
  2. The output log from Crawler
  3. What version (or commit sha) you're using for Crawler
  4. If the site is public facing, could you share a link to it?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.