Swifttype Crawl Rate

legislat.io · February 24, 2019, 9:04am

I'm trialing the service to see if it will help a project we're working on. but I don't seem to get the crawler working across the site.
I fed it the seed and a sitemap, but it only crawls 20 pages in 12 hours, vs the many thousand pages that exist.
Is there a better way to get the crawler working? or crawl with another tool then upload a url list?

goodroot · February 24, 2019, 5:56pm

Hello, Chris!

I took a look at the various sitemaps for https://legislat.io. It looks as though there are ~10 pages listed within the sitemaps.

We have a documentation page that may help.

The key item within that page is the concept of discovery. The crawler follows links within pages, unless directed otherwise, and in doing so "crawls/discovers" your pages.

The sitemaps that your robots.txt file references look like so:

https://legislat.io/sitemap-1.xml, 3 URLs
https://legislat.io/image-sitemap-1.xml, 4 images.
https://legislat.io/news-sitemap.xml, 3 URLs.

If, like the crawler, we follow the links within the items in your sitemap, we aren't left with many pages.

I would look at two potential action items:

Update the sitemaps so that they are comprehensive.
Verify that the website's structure is "hierarchical" in nature, and that links are available as part of a "discovery tree", akin to a "path through your content".

Hopefully this is helpful.

Enjoy the rest of the weekend!

Kellen

legislat.io · February 24, 2019, 6:45pm

Thanks Kellen,
The site we're indexing is http://www.legislation.gov.uk/ukpga
The .Io is what we're building out, very much work in progress.

For some reason it just gets the links on that page and stops. Is there a way to feed in a list of links?
Chris

goodroot · February 24, 2019, 6:59pm

Hey Chris --

Ah, I see - lots more links on that page.

There are a couple ways you could proceed...

Add URLs via the dashboard.
Add URLs via the API.
Reformat the sitemap. The URL you linked has a page called "sitemap", but it isn't technically a sitemap that adheres to the sitemap XML format. Discovery, as linked in the above reply, is still relevant here. The crawler is likely having troubles discovering the other pages.

Keep me posted!

Kellen

system · March 24, 2019, 6:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Swiftype stopped crawling all of a sudden - sitemap and robot files apparently are not reachable Elastic Search elastic-site-search	1	282	March 15, 2023
No documents after 20 hours of crawling Elastic Search elastic-site-search	7	1203	June 4, 2019
Sitemap for example.com located at sub.example.com Elastic Search elastic-site-search	12	1788	March 23, 2021
Additional sitemaps being ignored Elastic Search elastic-site-search	2	623	June 9, 2021
Sitemap in Swiftype Elastic Search elastic-site-search	2	753	April 19, 2019

Swifttype Crawl Rate

Related topics