Relative URLs/paths with no leading slash

krzyszto9 · February 25, 2021, 10:13pm

Hello everyone!
Some websites like http://www.gilacountyaz.gov/government/assessor/index.php have a bunch of internal links that should be absolute paths, but do not have the leading slash.

This causes a web crawler to generate wrong links. Instead of

https://www.gilacountyaz.gov/government/assessor/late_breaking_news.php

web crawler creates

https://www.gilacountyaz.gov/government/assessor/government/assessor/late_breaking_news.php

This can potentially create infinite loop and a lot of 404 errors.

Web browsers like Firefox or Chrome can handle this, because there is <base> tag present on the website.


    <head>
      <base href="http://www.gilacountyaz.gov/index.php"/>
    </head>

It allows browser to interpret these links correctly, but webcrawler is ignoring it. Is there any quick workaround that will make webcrawler work correctly?

Rich_Kuzsma · February 25, 2021, 10:40pm

It looks like a bug in the web crawler beta release. I don't see a workaround, but I let the Elastic engineering team know about it. Thanks for reporting this!

oleksiy-elastic · February 28, 2021, 7:13pm

Thank you very much for reporting this! I don't think there is a way to work around the problem since we simply don't support the base tag at the moment. We have added it to the roadmap and will make sure it is supported before GA.

krzyszto9 · February 28, 2021, 8:56pm

@oleksiy-elastic @Rich_Kuzsma Thank you very much for your help!

oleksiy-elastic · March 5, 2021, 2:10pm

Just wanted closing the loop here: The feature has been implemented and should be available in the next minor release of the solution (7.13).

system · April 2, 2021, 2:11pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Webcrawler does crawl relative url's in the same domain (incorrect protocol deny reason) Elastic Search elastic-app-search	3	436	September 8, 2021
App Search not chasing HTTP 302s when validating URLs? Elastic Search	3	355	November 4, 2022
Web Crawler Failed HTTP request: Unable to request "< domain >" because it resolved to only private/invalid addresses Elastic Search elastic-app-search	4	1129	May 18, 2021
Aggregation match forward slash and dot in query string Elasticsearch	1	513	April 9, 2020
Search in HTTP request field Kibana	3	3863	November 10, 2017

Relative URLs/paths with no leading slash

Related topics