I've been struggling to setup this web crawler properly. The sitemap is an XML sitemap, but the format is a table with the link to the page, number of images, and last modified date. The desired experience is for the crawler to start on the page with the list of links, click on each link, and then scrape the given xpath. I set up the crawler with the base domain (examplesite .com), set the entry point as the sitemap (examplesite .com/sitemap.xml), set the depth to 2, and ran the crawler. When I expect the documents created, many of them are from URLs that are not on the sitemap. I'm not sure where the crawler is getting these URLs from. Is anyone able to help me troubleshoot this?
TLDR; Is the site you're trying to crawl public? If so, can you share the domain, so that we can talk in specifics?
set the entry point as the sitemap (examplesite .com/sitemap.xml)
Sitemaps and entrypoints are similar, but distinct.
A Sitemap is something that is part of general website patterns. A lot like a robots.txt. Most sites will have a site will have a top-level sitemap.xml at
<domain>/sitemap.xml which describes the shape of the whole site.
An example: https://chilis.com/sitemap.xml
An Entrypoint is something more specific to our web crawler implementation. This helps if you want to give the crawl more personalized guidance on which pages it should start with, instead of trusting the site maintainers who wrote the sitemap.xml
For the above example, let's say I live in Utah, and want to build a search experience for Chilis restaurants in Utah only. I don't actually need a lot of the pages the sitemap will link out to. So I might add entrypoints for the Utah-based restaurants, and then disable the sitemap discovery in my crawl settings.
My guess (again, if you can share specifics it'll help) is that you didn't set an entrypoint for
examplesite.com/some/path/to/a/nonstandard/sitemap.xml. If this was a true sitemap format, cool, but if you didn't disable sitemap discovery, your crawl still looked up the default sitemap at
examplesite.com/sitemap.xml, which may be why you saw URLs you weren't expecting.
If it wasn't even a valid sitemap format, not only would you get the default sitemap URLs, but you probably didn't even get the URLs you were expecting, because (with sitemaps as an exception) crawler doesn't lookup links in arbitrary XML files it finds - it treats them just as binary documents.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.