Webcrawler does crawl relative url's in the same domain (incorrect protocol deny reason)

Hi,

our domain to crawl is https://www.unive.nl
We run the Enterprise Search on Elastic Cloud (Azure, Netherlands)

The full urls are crawled nicely.
Example: Fraudebeleid - Univé
Log message:
Time crawler.url.deny_reason message url.full event.type
Jul 30, 2021 @ 12:13:33.000 - Indexed the document into App Search with doc_id=6103d05f9c14a1e5fc7d36f1 Fraudebeleid - Univé info

A lot of pages have a relative url's which are not crawled (in 1 crawl of 45m we got 40,460 incorrect protocol hits). The webcrawler seems to try to crawl them directly instead of adding the domain to the relative url.
Example: Verhuizing en uw verzekeringen
Log message:
Time crawler.url.deny_reason message url.full event.type
Jul 30, 2021 @ 12:11:47.000 incorrect_protocol - /klantenservice/verhuizing denied

Btw. the webcrawler did find the full url but I think the webcrawler should also follow relative url's right?:
Time crawler.url.deny_reason message url.full event.type
Jul 30, 2021 @ 11:56:50.000 - Indexed the document into App Search with doc_id=6103ccc69c14a198d17c36c4 Verhuizing en uw verzekeringen info

Is this expected behavior? I think the webcrawler should follow these relative url's also? We also added the sitemap.xml url in the search engine, but that did not help.

Thanks.

Kind regards,
Mark

Hi @markn:

We have confirmed this is a bug in the crawler implementation.

The bug has been fixed in our codebase and will be released in version 7.14.1 and 7.15 for Enterprise Search. Stay tuned!

Thanks for reaching out, and the detailed examples that allowed us to fix this issue!

1 Like

Thanks Carlos for confirming this and fixing it in next releases!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.