Hello everyone!
Some websites like http://www.gilacountyaz.gov/government/assessor/index.php have a bunch of internal links that should be absolute paths, but do not have the leading slash.
This causes a web crawler to generate wrong links. Instead of
https://www.gilacountyaz.gov/government/assessor/late_breaking_news.php
web crawler creates
https://www.gilacountyaz.gov/government/assessor/government/assessor/late_breaking_news.php
This can potentially create infinite loop and a lot of 404 errors.
Web browsers like Firefox or Chrome can handle this, because there is <base>
tag present on the website.
<head>
<base href="http://www.gilacountyaz.gov/index.php"/>
</head>
It allows browser to interpret these links correctly, but webcrawler is ignoring it. Is there any quick workaround that will make webcrawler work correctly?