Use NTLM authentication while crawling domains

Disha_Bodade · March 13, 2023, 6:10am

Hi Team,
I am trying to crawl website which uses NTLM authentication. but I am not able to crawl it. I can't see any option in UI to add authentication details for website. and also in crawl api we have only basic and raw as a auth type.
Its not crawling any documents from our website.

Please suggest.

Thanks,
Disha

Chenhui_Wang · March 13, 2023, 7:04am

Hi Disha,

Unfortunately, there's no option to configure auth in UI, and you have to do this via API:
https://www.elastic.co/guide/en/app-search/current/web-crawler-reference.html#web-crawler-reference-http-authentication

You can update the domain with auth type raw, and the value will be used directly in the Authorization header (NTLM is one of the supported authentication schemes).

Disha_Bodade · March 13, 2023, 7:28am

Hi Chenhui_Wang,
Do I need to configure NTLM server as a http proxy in enterprise_search.yml like below?

can we use both appSearch authentication and website authorization in same update domain api

crawler.http.proxy.host:  auth.example.com
crawler.http.proxy.port: 443
crawler.http.proxy.protocol: https
crawler.http.proxy.username: username
crawler.http.proxy.password: password

Disha_Bodade · March 14, 2023, 1:27pm

Hi Team,
crawler event logs showing event.type: denied with below message
Unexpected content type for a crawl task with type=content

what does it means?

video · March 20, 2023, 2:17pm

Hi @Disha_Bodade,

Unexpected content type for a crawl task with type=content

When Crawler logs unexpected content type... this means it doesn't support or it couldn't recognize the response content-type header. Could you please share the URL if it can be accessed via the public internet or do a basic curl command like this:

curl -i {denied_url}

and include the output.

video · March 20, 2023, 2:21pm

You can use enterprice_search.yml however if you configure a proxy server in this way, all your crawlers will be utilizing those proxy properties.

If you want to use proxy configuration per domain, you can use Crawler API to add your configuration.

Disha_Bodade · April 7, 2023, 6:02pm

Our Application team has enabled basic auth for domain, but now also, when I am trying to crawl its showing

"fetch": {
                "timestamp": "Fri, 07 Apr 2023 16:04:15 +0000",
                "event_id": "64303effe4f766fe7de5b7ff",
                "message": "Unexpected content type  for a crawl task with type=content",
                "event_outcome": "failure",
                "duration_msec": 0.00596,
                "http_response": {
                    "status_code": 302,
                    "body_bytes": 0
                },
                "redirect": null
            }

Unexpected content type for a crawl task with type=content, I guess some misconfiguration I have done.

video · April 11, 2023, 9:53am

Have you configured your Crawler with the basic auth credentials?
If you are using Elastic web crawler - Managing crawls in Kibana | Elastic Enterprise Search documentation [8.7] | Elastic.

If you are using App Search Crawler Web crawler reference | Elastic App Search Documentation [8.7] | Elastic

Could you please verify that your website is accessible via curl or similar tools using basic auth?

Disha_Bodade · April 11, 2023, 10:09am

Hi Dimitrii,
It seems issue with how domain was accepting authentication, as application team setup a proxy and proxy is taking care of authentication, I am able to crawl domain pages properly.

Thanks,
Disha

system · May 9, 2023, 10:10am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.