Crawl sitemap only

pngworkforce · February 27, 2025, 2:39pm

Hello!

I have a question similar to the one listed here -> Web crawler is crawling URLs that are not on the sitemap

I have added a sitemap index as the sitemap to my crawler in Elastic Cloud UI. How do I instruct the crawler to just crawl the URLs listed in the sitemap files referenced in the sitemap index?

Setting connector.crawler.crawl.max_crawl_depth.limit to 1 doesn't work, as referenced in the post linked above, sitemaps are treated differently to entry points. Is there some other way to crawl only whats referenced in the sitemaps?

Adding 1000's of links as entry points would be inefficient

Thanks!

Imran

Jedr_Blaszyk · March 3, 2025, 10:32am

Hey @pngworkforce! Yes, sitemaps and entrypoints are different, so setting a sitemap as an entrypoint won’t work since the crawler treats XML just as binary files (hence it won't follow links referenced inside). However, starting a crawl with a custom sitemap (with automatic sitemap discovery disabled) should limit crawling to pages referenced in the sitemap.

What do you mean by “doesn’t work”? Are you getting too many or too few documents crawled?

pngworkforce · March 4, 2025, 12:58am

Hi @Jedr_Blaszyk

Thanks for the reply. We are getting too many show up in the crawl. We are occasionally even getting domains that aren't configured being added in the index (usually only one or 2 pages of other domains, but fairly common).

We have a index with a single entrypoint and 2 sitemaps - one for pages and the other for docuements and other assets. All are using the same TLD.

We can just exclude these extra domain results by adding that domain to the crawler and then disallowing all, but from what I understand it shouldn't crawl any documents that don't match the configured domain.

Other than results from the domain we aren't expecting, there are other documents from the domain we have configured being included that aren't in the sitemap. Is there any way to crawl and strictly stick to the sitemap?

Cheers

nfeekery · March 4, 2025, 9:43am

Hi @pngworkforce

If your site is public, can you share the domain so we can analyse it more clearly?

Also, can you provide the following information?

What version of Elasticsearch/Enterprise Search are you using?
Are you using App Search Crawler or Elastic Crawler?
Do you schedule full crawls, or run them manually through the UI?
Do you have any scheduled partial crawls? (these are set through the "Crawl with custom settings" UI button)
What other Crawler server configurations have you set aside from max crawl depth?
Are the sitemaps located on the same domain as the "main" domain for this crawler?
You mentioned some domains are being mistakenly ingested. Are these subdomains, or completely different domains? (e.g. if the main domain is xyz.com, is the other domain a subdomain like abc.xyz.com or completely different like 123.com)

pngworkforce · March 4, 2025, 10:54pm

Hi @nfeekery

The site isn't publicly available unfortunately. Answers to your questions below

We are using 8.16.0

Elastic web crawler in the cloud UI

We do both. We have a full crawl on a 3 hr update schedule and we often manually crawl.

No, we don't schedule partial crawls

We have also tried setting the link extraction to 0 (which is out of range) and 1

connector.crawler.crawl.max_crawl_depth.limit: 1
connector.crawler.extraction.extracted_links_count.limit: 1

Yes, sitemaps are in the root of the domain web folder

Yes, the domain being included is another subdomain

For example, our entry point domain is test-www.website.com
Our sitemaps are test-www.website.com/sitemap.xml and test-www.website.com/sitemap-assets.xml. Both of these are sitemap index files containing links to other sitemaps for each site section

The domain being found in our results alongside the expected results is test-assets.website.com. There aren't many of these in the index (about 50 out of 10,000 docs) but they do show up in query hits regularly. They also have the same URL path as results from the correct test-www.website.com domain.

We have checked the sitemaps and canonical URLs, this domain does not exist anywhere in either.

nfeekery · March 10, 2025, 10:47am

@pngworkforce Thanks for providing all of that information. It's certainly unexpected behaviour. The server settings you've set should work as expected.

I tried reproducing a similar setup but I couldn't reproduce it, so there must be something that we're missing here. I have some more follow up questions.

The incorrect test-assets.website.com URLs must be being discovered by something. Do the URL links on pages under test-www.website.com have links to test-assets.website.com?
Has this index ever previously been used to crawl URLs under the domain test-assets.website.com?
What do you have registered in your "Domains" section?
What crawl rules have you configured?
If you create a new web crawler for a new index and crawl the same website, do you still ingest the same incorrect data?

pngworkforce · March 16, 2025, 9:46am

Hi @nfeekery , answers to your questions below

The incorrect test-assets.website.com URLs must be being discovered by something. Do the URL links on pages under test-www.website.com have links to test-assets.website.com?

Yes, some pages do have this linked from them

Has this index ever previously been used to crawl URLs under the domain test-assets.website.com?

No, never set up for this site

What do you have registered in your "Domains" section?

Just https://test-www.website.com

What crawl rules have you configured?

No specific crawl rules. Just the allow all default

If you create a new web crawler for a new index and crawl the same website, do you still ingest the same incorrect data?

Yes, we were albe to reproduce this issue in a separate index with the same settings.

nfeekery · March 20, 2025, 7:53am

@pngworkforce Thanks for providing the extra information.
It sounds like you might have found a bug. By all accounts the way you're describing your setup, it should work. If I were to guess at the cause, I think Crawler isn't properly rejecting URLs from other subdomains, possibly because they share a main domain.

In the meantime, you mentioned that there is a workaround, which is to add test-assets.website.com as a domain and give it a crawl rule rejecting all URLs. Could you continue doing this while we investigate on our end?

Sorry for the inconvenience and thank you again for providing all of the extra information I requested.

system · April 17, 2025, 7:54am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Web crawler is crawling URLs that are not on the sitemap Elastic Search	2	753	February 14, 2024
How to index only given urls in the Elasticsearch using Open Crawler Elastic Search crawler	4	217	July 24, 2024
Additional sitemaps being ignored Elastic Search elastic-site-search	2	640	June 9, 2021
Sitemap for example.com located at sub.example.com Elastic Search elastic-site-search	12	1927	March 23, 2021
Web crawler (github.com/elastic/crawler) to only fetch specific URLS Elastic Search	2	59	April 7, 2025

Crawl sitemap only

Related topics