Swifttype Crawl Rate

I'm trialing the service to see if it will help a project we're working on. but I don't seem to get the crawler working across the site.
I fed it the seed and a sitemap, but it only crawls 20 pages in 12 hours, vs the many thousand pages that exist.
Is there a better way to get the crawler working? or crawl with another tool then upload a url list?

Hello, Chris!

I took a look at the various sitemaps for https://legislat.io. It looks as though there are ~10 pages listed within the sitemaps.

We have a documentation page that may help.

The key item within that page is the concept of discovery. The crawler follows links within pages, unless directed otherwise, and in doing so "crawls/discovers" your pages.

The sitemaps that your robots.txt file references look like so:

  1. https://legislat.io/sitemap-1.xml, 3 URLs
  2. https://legislat.io/image-sitemap-1.xml, 4 images.
  3. https://legislat.io/news-sitemap.xml, 3 URLs.

If, like the crawler, we follow the links within the items in your sitemap, we aren't left with many pages.

I would look at two potential action items:

  1. Update the sitemaps so that they are comprehensive.
  2. Verify that the website's structure is "hierarchical" in nature, and that links are available as part of a "discovery tree", akin to a "path through your content".

Hopefully this is helpful.

Enjoy the rest of the weekend!

Kellen

Thanks Kellen,
The site we're indexing is http://www.legislation.gov.uk/ukpga
The .Io is what we're building out, very much work in progress.

For some reason it just gets the links on that page and stops. Is there a way to feed in a list of links?
Chris

Hey Chris --

Ah, I see - lots more links on that page. :grin:

There are a couple ways you could proceed...

  1. Add URLs via the dashboard.

  2. Add URLs via the API.

  3. Reformat the sitemap. The URL you linked has a page called "sitemap", but it isn't technically a sitemap that adheres to the sitemap XML format. Discovery, as linked in the above reply, is still relevant here. The crawler is likely having troubles discovering the other pages.

Keep me posted!

Kellen

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.