Dec 18th, 2021: [en] Crawling content from private network environments using a Crawler deployed on Elastic Cloud

oleksiy-elastic · December 18, 2021, 8:00am

Enterprise Search Crawler offers App Search users a powerful ingestion mechanism. Users can index any content available through an HTTP interface and build powerful search experiences without having to write any code to automate content ingestion. We at Elastic work tirelessly to ensure that Elastic Cloud is the best place for running all our solutions, including Enterprise Search.

Until recently, there was one specific use case for Elastic Crawler that did not work for Elastic Cloud users: if customers wanted to use the Crawler to index some content hosted behind a firewall (within a private corporate network, cloud VPC or other isolated environment), they could not use Elastic Cloud to run Enterprise Search since their Crawler instances needed access to their private web resources. A recommended solution until recently has been to run a self-managed version of Enterprise Search (using ECE, ECK or other options) on an infrastructure that has access to the private network containing the content.

Starting with version 7.16.1 of the Enterprise Search solution, users will be able to use authenticated HTTP proxy servers for performing crawls. This allows a whole set of different use cases where the Crawler is hosted on Elastic Cloud, but has access to protected resources in private networks and other non-public environments. In this article we would like to provide a walk-through of a typical configuration that would allow you to crawl your content hosted on a private network.

Infrastructure overview

To demonstrate the authenticated proxy server functionality in Enterprise Search Crawler, we are going to use the following infrastructure. Although relatively simple, it provides us with all the components needed to understand the use case:

blog-infra

We have a VPC (private network) deployed within Google Cloud
The VPC contains two cloud instances: web-server and proxy:
- The web-server hosts a private website, that is only accessible from within the VPC using a private domain name http://marlin-docs.internal.
- The proxy server has HTTP authentication set up (user: proxyuser, password: proxypass) and is accessible from the public internet using the proxy.acme.com host name and the port 3128.
We have an Enterprise Search deployment running outside of the VPC, in Elastic Cloud.
- This deployment has no access to the private network hosting our content.

Please note: While TLS on the proxy server is not an absolute requirement and we support both HTTP and HTTPS, we recommend spending time to configure TLS to ensure the safety of your content and proxy server credentials while they are being transferred over the internet. Refer to the documentation for your proxy server software for more details on how to set it up.

Testing proxy connections

Before we start changing our Enterprise Search deployment configuration to use the HTTP proxy described above, let's make sure the proxy actually works and allows us access to the private website.

One option to perform the test would be to configure your web browser to use the proxy and try accessing the private website. Alternatively, we can use the following command to fetch the home page from the site using a given proxy:

curl --head --proxy https://proxyuser:proxypass@proxy.acme.com http://marlin-docs.internal

This request should perform a HEAD request to the website while using the proxy. The response should be an HTTP 200 with a set of additional headers. Here is an example response:

HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 42337
Accept-Ranges: bytes
Server: nginx/1.14.2
Date: Tue, 30 Nov 2021 19:19:14 GMT
Last-Modified: Tue, 30 Nov 2021 17:57:39 GMT
ETag: "61a66613-a561"
Age: 4
X-Cache: HIT from oleksiy-blog-proxy
X-Cache-Lookup: HIT from oleksiy-blog-proxy:3128
Via: 1.1 oleksiy-blog-proxy (squid/4.6)
Connection: keep-alive

Now that we know our proxy credentials and connection parameters are correct, we can proceed to changing the Enterprise Search configuration.

Configuring Enterprise Search

To prepare our Enterprise Search deployment for using the HTTP proxy for all Crawler operations, we need to add the following custom settings to its configuration file:

crawler.http.proxy.host: proxy.acme.com
crawler.http.proxy.port: 3128
crawler.http.proxy.protocol: https
crawler.http.proxy.username: proxyuser
crawler.http.proxy.password: proxypass

After adding this configuration, the deployment will perform a graceful restart. You can find detailed instructions on how to work with custom configurations it our official Cloud documentation.

Testing the solution

Now that everything is in place, we can create an App Search engine and use the Web Crawler feature to ingest content into the engine:

Note: The validation process used for adding a domain to the Crawler configuration will skip a number of networking-related checks since those do not work through a proxy. If you do not see the warning, you need to check your deployment configuration to make sure you have configured the proxy correctly.

After adding our private domain to the configuration, we can start the crawl and should see the content being ingested into the deployment. If you see any failures at this stage, we recommend you check your crawler logs and proxy server's logs (specific to your proxy server of choice) for any clues on what might be going wrong.

Here is how the proxy logs should look if you're using a Squid proxy server:

1638298043.202      1 99.250.74.78 TCP_MISS/200 65694 GET http://marlin-docs.internal/docs/gcode/M951.html proxyuser HIER_DIRECT/10.188.0.2 text/html
1638298045.286      1 99.250.74.78 TCP_MISS/200 64730 GET http://marlin-docs.internal/docs/gcode/M997.html proxyuser HIER_DIRECT/10.188.0.2 text/html
1638298045.373      1 99.250.74.78 TCP_MISS/200 63609 GET http://marlin-docs.internal/docs/gcode/M999.html proxyuser HIER_DIRECT/10.188.0.2 text/html

If you need any help setting all of this up for your deployment, feel free to reach out to Elastic Support or use community forums, Our team members will be happy to assist you through the process.

system · January 15, 2022, 8:00am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Use web crawler beta app search behind corporate proxy Elastic Search elastic-app-search	4	611	August 30, 2021
Can I crawl intranet web? Elastic Search	3	696	November 4, 2022
Hosting and securing ElasticSearch Elasticsearch	5	643	July 6, 2017
Twitter river proxy settings? Elasticsearch	8	532	July 6, 2017
Didn't see an official feature request for HTTPS/basic authetication Elasticsearch	4	330	July 6, 2017

Dec 18th, 2021: [en] Crawling content from private network environments using a Crawler deployed on Elastic Cloud

Infrastructure overview

Testing proxy connections

Configuring Enterprise Search

Testing the solution

Related topics