Use web crawler beta app search behind corporate proxy

Hi,

we run ECE on premise in our data center. We have deployed an Enterprise Search App Search engine. We want to use the web crawler functionality. But the connection towards public internet pages runs via a corporate proxy.

I could not find this in the documentation nor on this forum.

Error message in logs: "Allow none because robots.txt responded with status 599".

How can we configure the web crawler to use a proxy for internet connectivity?
Thanks.

Kind regards,
Mark

Hi, maybe I did not explain it well enough.
We want to crawl websites like https://www.unive.nl
But to reach that website we have to go via the corporate proxy in our datacenter.
I think the web crawler is not aware of that proxy and tries to resolve www.unive.nl directly and that will fail in our datacenter.

My question is: is it possible to configure the web crawler so it uses the proxy to go outside the datacenter?

Thanks.

Cheers,
Mark

Hello Mark,

Thank you for trying out the Crawler for your project! Unfortunately, there is no support for running behind a proxy yet. I'll add it to our roadmap since I suspect there will be more potential customers who may need this kind of mode of operation for the product.

In the meantime, there are some options, but they are completely unsupported by Elastic: If you run Enterprise Search outside of ECE (you need more control over the environment around the product to apply those kinds of solutions), you may be able to coerce it to using the proxy by applying socksify or a similar OS-level TCP connection routing mechanism: socksify(1) - Linux man page. Alternatively, there is transparent proxying support in many proxy servers (see https://wiki.squid-cache.org/Features/Tproxy4 for example), but it requires some really deep understanding of linux firewalls, etc to implement.

I hope this helps.

Thanks for putting it on the roadmap!

In general I think when Elastic should support proxy by default in all their products. Lot of enterprise companies will have to run behind a proxy. But that's just my humble opinion.

Thanks for the alternatives, unfortunately those are a no go for us.

Kind regards,
Mark

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.