Elastic Open Web Crawler

Hello,

I am trying out the Open Web Crawler in preperation of whenever it replaces the existing Enterprise Search crawler and associated functionality.

My environment uses a single node Elasticsearch, Kibana and Enterprise Search for the current Enterprise Search crawler. All on v8.17.2 in a single VM. Works as expected.

I am setting up the Open Crawler in the same VM but running into an issue when trying to start a crawl.

Starting a crawl will kick out the following error:

elastic crawler]# docker exec -it crawler bin/crawler crawl config/crawler.yml
[crawl:67ee7d1ae16dda5d61cce017] [primary] Initialized an in-memory URL queue for up to 10000 URLs
[crawl:67ee7d1ae16dda5d61cce017] [primary] ES connections will be authorized with configured API key
[crawl:67ee7d1ae16dda5d61cce017] [primary] ES connections will use SSL without ca_fingerprint
The client is unable to verify that the server is Elasticsearch. Some functionality may not be compatible if the server is running an unsupported product.
[crawl:67ee7d1ae16dda5d61cce017] [primary] Failed to reach ES at https://localhost:9200

If I run the following from the crawler container to see if I can reach the Elasticsearch port, it seems ok:

elastic crawler]# docker exec -it crawler sh
/app $ nc -zv 192.168.50.96 9200
192.168.50.96 (192.168.50.96:9200) open

If I run the same from the VM commandline, it also responds:

elastic crawler]# nc -zv 192.168.50.96 9200
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: Connected to 192.168.50.96:9200.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.

If I run either of these with 'localhost' instead of the IP, they still get a presumably good response:

/app $ nc -zv localhost 9200
localhost ([::1]:9200) open

elastic crawler]# nc -zv localhost 9200
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: Connected to ::1:9200.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.

I have the crawler.yaml configured as below:

elasticsearch:
  host: https://localhost
  port: 9200
  api_key: [redacted]

output_sink: elasticsearch
output_index: my-search-testindex

domains:
  - url: https://contoso.com

I do have the Elasticsearch host set up with security. In the current Enterprise Search config I have a reference for...

elasticsearch.ssl.enabled: true
elasticsearch.ssl.certificate_authority: /usr/share/enterprise-search/http_ca.crt

I figure I am missing something in documentation or elsewhere. :slight_smile: Any pointers are appreciated!

I think you are missing the certificate_authority.

I'd add:

elasticsearch.ca_fingerprint: /usr/share/enterprise-search/http_ca.crt

Hey, David. Appreciate the response.

I gave that a shot, doing the following so the location shows up in the container, adding the enterprise-search path where the Elasticsearch cert is located:

docker-compose.yaml

services:
  crawler:
    image: docker.elastic.co/integrations/crawler:${CRAWLER_VERSION:-latest}
    container_name: crawler
    volumes:
      - /opt/crawler/config:/app/config
      - /usr/share/enterprise-search:/app/certs

Then trying to apply the config from the crawler.yml

elasticsearch:
  host: https://localhost
  port: 9200
  api_key: [redacted]
  ca_fingerprint: /app/certs/http_ca.crt

But after trying to start the crawl, it came back with the same:

The client is unable to verify that the server is Elasticsearch. Some functionality may not be compatible if the server is running an unsupported product.
[crawl:67eeaaefe16dda45e6951adf] [primary] Failed to reach ES at https://localhost:9200

I managed to get the crawler to work after some configuration changes, mostly to the API key permissions.

I definitely had an incorrect reference to the 'ca_fingerprint'. Shouldn't be a direct ref to the cert but the fingerprint - as it states. That got me further along.

Then I was still hitting an error, but slightly different:

[crawl:67f02b2ce16dda4d2a6e37cd] [primary] ES connections will use SSL with ca_fingerprint The client is unable to verify that the server is Elasticsearch due to security privileges on the server side. Some functionality may not be compatible if the server is running an unsupported product. [crawl:67f02b2ce16dda4d2a6e37cd] [primary] Failed to reach ES at https://localhost:9200

Progress, but another speedpbump.

Then I tried to test with curl refrencing the API key... :
curl -v --cacert /path/to/elasticsearch/http_ca.crt -H "Authorization: ApiKey [redacted]" https://localhost:9200

... and hit an error, fortunately a bit more telling of what was happening there:

{"error":{"root_cause":[{"type":"security_exception","reason":"action [cluster:monitor/main] is unauthorized for API key id [44fu9pUB4gFCq9OBqgcc] of user [elastic], this action is granted by the cluster privileges [monitor,manage,all]"}],"type":"security_exception","reason":"action [cluster:monitor/main] is unauthorized for API key id [44fu9pUB4gFCq9OBqgcc] of user [elastic], this action is granted by the cluster privileges [monitor,manage,all]"},"status":403}

Presumably there is/was some type of touch by the crawler making sure who it is talking to and read cluster metadata eg. version and capabilities and whatnot. The permissions to do this were not assigned in the API key, as noted in the error output. I then added a cluster level perm to do this: 'monitor'

The curl test then came back OK:

curl -v --cacert /path/to/elasticsearch/http_ca.crt -H "Authorization: ApiKey [redacted]" https://localhost:9200
...
* Server certificate:
*  subject: CN=elastic
*  start date: Apr  2 13:38:01 2025 GMT
*  expire date: Apr  2 13:38:01 2027 GMT
*  subjectAltName: host "localhost" matched cert's "localhost"
*  issuer: CN=Elasticsearch security auto-configuration HTTP CA
*  SSL certificate verify ok.
* TLSv1.3 (OUT), TLS app data, [no content] (0):
> GET / HTTP/1.1
> Host: localhost:9200
> User-Agent: curl/7.61.1
> Accept: */*
> Authorization: ApiKey [redacted]
>
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS app data, [no content] (0):
< HTTP/1.1 200 OK
< X-elastic-product: Elasticsearch
< content-type: application/json
< content-length: 532
<
{
  "name" : "elastic",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "8.17.2",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_date" : "2025-02-05T22:10:57.067596412Z",
    "build_snapshot" : false,
    "lucene_version" : "9.12.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}
* Connection #0 to host localhost left intact

Tried a crawl and hit a new error:
[crawl:67f030fce16dda43f90ac246] [primary] Connected to ES at https://elastic:9200 - version: 8.17.2; build flavor: default Elastic::Transport::Transport::Errors::Forbidden: [403]

Seemed like a similar check of some sort if the new index is there or not and the API key didn't have a permission to do this. Did a little peruse on the security privs and found 'view_index_metadata'. Added that to the API key privs.

Ran another crawl and it started! :grinning_face: Then hit an error :confused: at the end as some sort of cleanup action(s):
action [indices:admin/refresh] is unauthorized

Added another API perm based on the 'refresh' hint in the output - 'maintenance'.

Kicked off another crawl and it completed with zero errors. The ending API key config looks like so:


  "contoso_crawler_writer": {
    "cluster": [
      "monitor"
    ],
    "indices": [
      {
        "names": [
          "test-search-testindex"
        ],
        "privileges": [
          "read",
          "write",
          "create",
          "create_index",
          "view_index_metadata",
          "maintenance"
        ],
        "allow_restricted_indices": false
      }
    ],
    "applications": [],
    "run_as": [],
    "metadata": {},
    "transient_metadata": {
      "enabled": true
    }
  }
}

Now it's on to the other crawler configs and customization testing!

1 Like