Web crawler Error: Allow none because robots.txt responded with status 599, read_timeout

sivagurlinka · May 16, 2024, 9:34am

Enterprise search web crawler is giving the below error in logs
Allow none because robots.txt responded with status 599
Error: read_timeout

what could be the potential issue and Please suggest resolution.

I see the below exceptions while adding domain. I can able to access, domain URL and robots.txt file through browser.

This issue is happening for one specific public domain only, I am able to crawl other public websites

Below is EES config


allow_es_settings_modification: true
elasticsearch.host: https://xxxxxxxxxxxxx:9200
elasticsearch.ssl.enabled: true
elasticsearch.ssl.verify: false
kibana.host: http://xxxxxxxx:5601
ent_search.listen_host: 0.0.0.0
ent_search.listen_port: 3002

connector.crawler.http.proxy.host: xxxxxxxxxxxx
connector.crawler.http.proxy.port: 80
connector.crawler.http.proxy.protocol: http
connector.crawler.security.dns.allow_private_networks_access: true
connector.crawler.security.dns.allow_loopback_access: true
connector.crawler.content_extraction.enabled: true
connector.crawler.content_extraction.mime_types: ["application/pdf", "application/msword", "text/plain", "application/xml", "text/html", "text/css"]

crawler.http.proxy.host: xxxxxxxxxxxx
crawler.http.proxy.port: 80
crawler.http.proxy.protocol: http
crawler.security.dns.allow_loopback_access: true
crawler.security.dns.allow_private_networks_access: true
crawler.content_extraction.enabled: true
crawler.content_extraction.mime_types: ["application/pdf", "application/msword", "text/plain", "application/xml", "text/html", "text/css"]

ent_search.ssl.enabled: false
crawler.security.ssl.verification_mode: none
connector.crawler.security.ssl.verification_mode: none
crawler.http.request_timeout: 90
crawler.http.read_timeout: 30

Topic		Replies	Views
Cannot crawl a website Elastic Search elastic-app-search	5	694	August 30, 2021
Web Crawler Failed HTTP request: Unable to request "< domain >" because it resolved to only private/invalid addresses Elastic Search elastic-app-search	4	1160	May 18, 2021
Read_timeout error for url content extraction using App Search API Elasticsearch	1	22	July 29, 2024
Use web crawler beta app search behind corporate proxy Elastic Search elastic-app-search	4	585	August 30, 2021
Read timeout error after setting the request_timeout Elasticsearch	4	14934	May 18, 2018

Web crawler Error: Allow none because robots.txt responded with status 599, read_timeout

Related topics