How to index only given urls in the Elasticsearch using Open Crawler

jahedi · June 23, 2024, 10:44am

Hi, I was just wondering how I can index only those urls that I have given in the crawlerconfig.yml file. Let me explain better, for example I have this config file for the crawler:

domain_allowlist:
  - https://a.com

seed_urls:
  - https//a.com/foo
  - https://a.com/foo/bar
  - https://a.com/baz/12/test

... other configs

I want to just index the 3 given urls in the Elasticsearch and not go farther!
I did try the option max_crawl_depth: 1 on the config file but it didn't work properly.

nfeekery · June 24, 2024, 8:04am

Hi @jahedi

I did try the option max_crawl_depth: 1 on the config file but it didn't work properly.

In theory the max_crawl_depth: 1 should be enough to do this. Can you describe what happened? How many docs did it end up ingesting?

One possibility is the Crawler can also use the sitemap.xml for content discovery. Do you have this file in your website? If so, you can set sitemap_discovery_disabled: true in the config to ignore it.

If that isn't the case, this is likely a bug.

jahedi · June 26, 2024, 4:46am

Hi @nfeekery
I had 3 seed urls and finally just one of them was indexed in the Elasticsearch.
Actually I was reindexing my web pages. I just wanted to reindex 3 pages to update their content in the Elasticsearch and just one of them was updated. I checked the pages based on the last_crawled_at field and just one page was recently updated.

nfeekery · June 26, 2024, 8:45am

@jahedi I tried to reproduce this by crawling a full site, then specifying only 3 seed URLs and crawling again. I found that the 3 seed URL docs were updated in my case.

Can you provide a few things to help figure out what's going on?

All other configs you've set in the yaml file(s)
The output log from Crawler
What version (or commit sha) you're using for Crawler
If the site is public facing, could you share a link to it?

system · July 24, 2024, 8:45am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to customize "crawl_config" parameters for web crawler Elastic Search elastic-app-search	3	172	June 5, 2024
Elasticsearch Crawling Elastic Community and Ecosystem	7	3327	December 11, 2017
Crawling with ElasticSearch Elasticsearch	1	249	July 6, 2017
Proper YAML Configuration of ElasticSearch Elasticsearch	19	1896	July 6, 2017
How to index XML data Elasticsearch	7	6290	July 6, 2017

How to index only given urls in the Elasticsearch using Open Crawler

Related topics