jerrac
(David Reagan)
April 19, 2021, 10:00pm
1
When I try to run the web crawler against a site we host, it fails with this error:
Failed HTTP request: Unable to request "< domain >" because it resolved to only private/invalid addresses
The site in question would resolve to a 10.n.n.n ip address. Is the crawler configured to reject that? Is there a way to override that behavior?
jerrac
(David Reagan)
April 19, 2021, 10:12pm
2
Not sure it's related, but if I target my personal site, not hosted internally, it fails as well.
In the logs I see:
Allow none because robots.txt responded with status 599
and
Failed HTTP request: Remote host terminated the handshake
That also happens if I target Elastic Blog: Stories, Tutorials, Releases | Elastic Blog .
I double checked my personal site's robot.txt file. It's the default Drupal 8 robots.txt file. So there shouldn't be anything in it that would completely block the crawler.
Anyway, I'm glad this still beta.
orhantoy
(Orhan Toy)
April 20, 2021, 9:02am
3
Yes, that's the current, default behavior and it will become configurable in the next minor release.
As for the other issue you're experiencing, it sounds like you can't crawl any site at all, is that correct?
jerrac
(David Reagan)
April 20, 2021, 3:29pm
4
Nice.
Yep, can't crawl my personal site, or Elastic Blog: Stories, Tutorials, Releases | Elastic Blog . Haven't tried any other sites yet.
1 Like
system
(system)
Closed
May 18, 2021, 3:30pm
5
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.