Relative URLs/paths with no leading slash

Hello everyone!
Some websites like http://www.gilacountyaz.gov/government/assessor/index.php have a bunch of internal links that should be absolute paths, but do not have the leading slash.

This causes a web crawler to generate wrong links. Instead of

https://www.gilacountyaz.gov/government/assessor/late_breaking_news.php

web crawler creates

https://www.gilacountyaz.gov/government/assessor/government/assessor/late_breaking_news.php

This can potentially create infinite loop and a lot of 404 errors.

Web browsers like Firefox or Chrome can handle this, because there is <base> tag present on the website.


    <head>
      <base href="http://www.gilacountyaz.gov/index.php"/>
    </head>

It allows browser to interpret these links correctly, but webcrawler is ignoring it. Is there any quick workaround that will make webcrawler work correctly?

It looks like a bug in the web crawler beta release. I don't see a workaround, but I let the Elastic engineering team know about it. Thanks for reporting this!

1 Like

Thank you very much for reporting this! I don't think there is a way to work around the problem since we simply don't support the base tag at the moment. We have added it to the roadmap and will make sure it is supported before GA.

@oleksiy-elastic @Rich_Kuzsma Thank you very much for your help!

Just wanted closing the loop here: The feature has been implemented and should be available in the next minor release of the solution (7.13).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.