I have added a sitemap index as the sitemap to my crawler in Elastic Cloud UI. How do I instruct the crawler to just crawl the URLs listed in the sitemap files referenced in the sitemap index?
Setting connector.crawler.crawl.max_crawl_depth.limit to 1 doesn't work, as referenced in the post linked above, sitemaps are treated differently to entry points. Is there some other way to crawl only whats referenced in the sitemaps?
Adding 1000's of links as entry points would be inefficient
Hey @pngworkforce! Yes, sitemaps and entrypoints are different, so setting a sitemap as an entrypoint won’t work since the crawler treats XML just as binary files (hence it won't follow links referenced inside). However, starting a crawl with a custom sitemap (with automatic sitemap discovery disabled) should limit crawling to pages referenced in the sitemap.
What do you mean by “doesn’t work”? Are you getting too many or too few documents crawled?
Thanks for the reply. We are getting too many show up in the crawl. We are occasionally even getting domains that aren't configured being added in the index (usually only one or 2 pages of other domains, but fairly common).
We have a index with a single entrypoint and 2 sitemaps - one for pages and the other for docuements and other assets. All are using the same TLD.
We can just exclude these extra domain results by adding that domain to the crawler and then disallowing all, but from what I understand it shouldn't crawl any documents that don't match the configured domain.
Other than results from the domain we aren't expecting, there are other documents from the domain we have configured being included that aren't in the sitemap. Is there any way to crawl and strictly stick to the sitemap?
If your site is public, can you share the domain so we can analyse it more clearly?
Also, can you provide the following information?
What version of Elasticsearch/Enterprise Search are you using?
Are you using App Search Crawler or Elastic Crawler?
Do you schedule full crawls, or run them manually through the UI?
Do you have any scheduled partial crawls? (these are set through the "Crawl with custom settings" UI button)
What other Crawler server configurations have you set aside from max crawl depth?
Are the sitemaps located on the same domain as the "main" domain for this crawler?
You mentioned some domains are being mistakenly ingested. Are these subdomains, or completely different domains? (e.g. if the main domain is xyz.com, is the other domain a subdomain like abc.xyz.com or completely different like 123.com)
Yes, sitemaps are in the root of the domain web folder
Yes, the domain being included is another subdomain
For example, our entry point domain is test-www.website.com
Our sitemaps are test-www.website.com/sitemap.xml and test-www.website.com/sitemap-assets.xml. Both of these are sitemap index files containing links to other sitemaps for each site section
The domain being found in our results alongside the expected results is test-assets.website.com. There aren't many of these in the index (about 50 out of 10,000 docs) but they do show up in query hits regularly. They also have the same URL path as results from the correct test-www.website.com domain.
We have checked the sitemaps and canonical URLs, this domain does not exist anywhere in either.
@pngworkforce Thanks for providing all of that information. It's certainly unexpected behaviour. The server settings you've set should work as expected.
I tried reproducing a similar setup but I couldn't reproduce it, so there must be something that we're missing here. I have some more follow up questions.
The incorrect test-assets.website.com URLs must be being discovered by something. Do the URL links on pages under test-www.website.com have links to test-assets.website.com?
Has this index ever previously been used to crawl URLs under the domain test-assets.website.com?
What do you have registered in your "Domains" section?
What crawl rules have you configured?
If you create a new web crawler for a new index and crawl the same website, do you still ingest the same incorrect data?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.