Hi,
our domain to crawl is https://www.unive.nl
We run the Enterprise Search on Elastic Cloud (Azure, Netherlands)
The full urls are crawled nicely.
Example: Fraudebeleid - Univé
Log message:
Time crawler.url.deny_reason message url.full event.type
Jul 30, 2021 @ 12:13:33.000 - Indexed the document into App Search with doc_id=6103d05f9c14a1e5fc7d36f1 Fraudebeleid - Univé info
A lot of pages have a relative url's which are not crawled (in 1 crawl of 45m we got 40,460 incorrect protocol hits). The webcrawler seems to try to crawl them directly instead of adding the domain to the relative url.
Example: Verhuizing en uw verzekeringen
Log message:
Time crawler.url.deny_reason message url.full event.type
Jul 30, 2021 @ 12:11:47.000 incorrect_protocol - /klantenservice/verhuizing denied
Btw. the webcrawler did find the full url but I think the webcrawler should also follow relative url's right?:
Time crawler.url.deny_reason message url.full event.type
Jul 30, 2021 @ 11:56:50.000 - Indexed the document into App Search with doc_id=6103ccc69c14a198d17c36c4 Verhuizing en uw verzekeringen info
Is this expected behavior? I think the webcrawler should follow these relative url's also? We also added the sitemap.xml url in the search engine, but that did not help.
Thanks.
Kind regards,
Mark