Enterprise Search Crawler offers App Search users a powerful ingestion mechanism. Users can index any content available through an HTTP interface and build powerful search experiences without having to write any code to automate content ingestion. We at Elastic work tirelessly to ensure that Elastic Cloud is the best place for running all our solutions, including Enterprise Search.
Until recently, there was one specific use case for Elastic Crawler that did not work for Elastic Cloud users: if customers wanted to use the Crawler to index some content hosted behind a firewall (within a private corporate network, cloud VPC or other isolated environment), they could not use Elastic Cloud to run Enterprise Search since their Crawler instances needed access to their private web resources. A recommended solution until recently has been to run a self-managed version of Enterprise Search (using ECE, ECK or other options) on an infrastructure that has access to the private network containing the content.
Starting with version 7.16.1 of the Enterprise Search solution, users will be able to use authenticated HTTP proxy servers for performing crawls. This allows a whole set of different use cases where the Crawler is hosted on Elastic Cloud, but has access to protected resources in private networks and other non-public environments. In this article we would like to provide a walk-through of a typical configuration that would allow you to crawl your content hosted on a private network.
To demonstrate the authenticated proxy server functionality in Enterprise Search Crawler, we are going to use the following infrastructure. Although relatively simple, it provides us with all the components needed to understand the use case:
- We have a VPC (private network) deployed within Google Cloud
- The VPC contains two cloud instances:
- The web-server hosts a private website, that is only accessible from within the VPC using a private domain name http://marlin-docs.internal.
- The proxy server has HTTP authentication set up (user:
proxypass) and is accessible from the public internet using the
proxy.acme.comhost name and the port
- We have an Enterprise Search deployment running outside of the VPC, in Elastic Cloud.
- This deployment has no access to the private network hosting our content.
Please note: While TLS on the proxy server is not an absolute requirement and we support both HTTP and HTTPS, we recommend spending time to configure TLS to ensure the safety of your content and proxy server credentials while they are being transferred over the internet. Refer to the documentation for your proxy server software for more details on how to set it up.
Before we start changing our Enterprise Search deployment configuration to use the HTTP proxy described above, let's make sure the proxy actually works and allows us access to the private website.
One option to perform the test would be to configure your web browser to use the proxy and try accessing the private website. Alternatively, we can use the following command to fetch the home page from the site using a given proxy:
curl --head --proxy https://proxyuser:firstname.lastname@example.org http://marlin-docs.internal
This request should perform a
HEAD request to the website while using the proxy. The response should be an HTTP 200 with a set of additional headers. Here is an example response:
HTTP/1.1 200 OK Content-Type: text/html Content-Length: 42337 Accept-Ranges: bytes Server: nginx/1.14.2 Date: Tue, 30 Nov 2021 19:19:14 GMT Last-Modified: Tue, 30 Nov 2021 17:57:39 GMT ETag: "61a66613-a561" Age: 4 X-Cache: HIT from oleksiy-blog-proxy X-Cache-Lookup: HIT from oleksiy-blog-proxy:3128 Via: 1.1 oleksiy-blog-proxy (squid/4.6) Connection: keep-alive
Now that we know our proxy credentials and connection parameters are correct, we can proceed to changing the Enterprise Search configuration.
To prepare our Enterprise Search deployment for using the HTTP proxy for all Crawler operations, we need to add the following custom settings to its configuration file:
crawler.http.proxy.host: proxy.acme.com crawler.http.proxy.port: 3128 crawler.http.proxy.protocol: https crawler.http.proxy.username: proxyuser crawler.http.proxy.password: proxypass
After adding this configuration, the deployment will perform a graceful restart. You can find detailed instructions on how to work with custom configurations it our official Cloud documentation.
Now that everything is in place, we can create an App Search engine and use the Web Crawler feature to ingest content into the engine:
Note: The validation process used for adding a domain to the Crawler configuration will skip a number of networking-related checks since those do not work through a proxy. If you do not see the warning, you need to check your deployment configuration to make sure you have configured the proxy correctly.
After adding our private domain to the configuration, we can start the crawl and should see the content being ingested into the deployment. If you see any failures at this stage, we recommend you check your crawler logs and proxy server's logs (specific to your proxy server of choice) for any clues on what might be going wrong.
Here is how the proxy logs should look if you're using a Squid proxy server:
1638298043.202 1 18.104.22.168 TCP_MISS/200 65694 GET http://marlin-docs.internal/docs/gcode/M951.html proxyuser HIER_DIRECT/10.188.0.2 text/html 1638298045.286 1 22.214.171.124 TCP_MISS/200 64730 GET http://marlin-docs.internal/docs/gcode/M997.html proxyuser HIER_DIRECT/10.188.0.2 text/html 1638298045.373 1 126.96.36.199 TCP_MISS/200 63609 GET http://marlin-docs.internal/docs/gcode/M999.html proxyuser HIER_DIRECT/10.188.0.2 text/html