Webcrawler does crawl relative url's in the same domain (incorrect protocol deny reason)

markn · August 2, 2021, 6:32am

Hi,

our domain to crawl is https://www.unive.nl
We run the Enterprise Search on Elastic Cloud (Azure, Netherlands)

The full urls are crawled nicely.
Example: Fraudebeleid - Univé
Log message:
Time crawler.url.deny_reason message url.full event.type
Jul 30, 2021 @ 12:13:33.000 - Indexed the document into App Search with doc_id=6103d05f9c14a1e5fc7d36f1 Fraudebeleid - Univé info

A lot of pages have a relative url's which are not crawled (in 1 crawl of 45m we got 40,460 incorrect protocol hits). The webcrawler seems to try to crawl them directly instead of adding the domain to the relative url.
Example: Verhuizing en uw verzekeringen
Log message:
Time crawler.url.deny_reason message url.full event.type
Jul 30, 2021 @ 12:11:47.000 incorrect_protocol - /klantenservice/verhuizing denied

Btw. the webcrawler did find the full url but I think the webcrawler should also follow relative url's right?:
Time crawler.url.deny_reason message url.full event.type
Jul 30, 2021 @ 11:56:50.000 - Indexed the document into App Search with doc_id=6103ccc69c14a198d17c36c4 Verhuizing en uw verzekeringen info

Is this expected behavior? I think the webcrawler should follow these relative url's also? We also added the sitemap.xml url in the search engine, but that did not help.

Thanks.

Kind regards,
Mark

Carlos_D · August 9, 2021, 11:29am

Hi @markn:

We have confirmed this is a bug in the crawler implementation.

The bug has been fixed in our codebase and will be released in version 7.14.1 and 7.15 for Enterprise Search. Stay tuned!

Thanks for reaching out, and the detailed examples that allowed us to fix this issue!

markn · August 11, 2021, 9:49am

Thanks Carlos for confirming this and fixing it in next releases!

system · September 8, 2021, 9:50am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Web Crawler Setup: "Content Verification" for Domain fails Elastic Search elastic-app-search	5	652	February 10, 2022
Web Crawler Failed HTTP request: Unable to request "< domain >" because it resolved to only private/invalid addresses Elastic Search elastic-app-search	4	1130	May 18, 2021
App Search not chasing HTTP 302s when validating URLs? Elastic Search	3	355	November 4, 2022
Web crawler DNS name resolution failed while validating domain Elastic Search elastic-app-search	2	303	November 1, 2023
Add a domain to get started - validate domain - failed Elastic Search	2	666	November 4, 2022

Webcrawler does crawl relative url's in the same domain (incorrect protocol deny reason)

Related topics