Pdf documents specified in the sitemap are not being indexed by web crawler

vsachdeva · June 14, 2024, 12:23am

Hello,
I am trying to index a set of PDF documents using the web crawler. The deployment is in GCP cloud and the PDF documents are specified in the sitemap, which is documented in the robots.txt file. I am not using workspace solution. Do I need to define an attachment processor in the ingestion pipeline? Thanks

vsachdeva · June 14, 2024, 3:58am

The log explorer is showing the message: Unexpected content type application/pdf for a crawl task with type=content
for each pdf document in the sitemap.

vsachdeva · June 14, 2024, 4:01am

I believe the issue should be resolved as per the documentation specified here:

I will update the web crawler configuration and provide an update.

Topic		Replies	Views
PDF's are not getting indexed in AppSearch when crawling it using AppSearch Web Crawler Elastic Search elastic-app-search	6	916	April 19, 2023
Document attachment question Elastic Search elastic-app-search	9	782	August 30, 2022
Index PDF in ES Elasticsearch	13	9279	March 27, 2017
Is it necessary to use Ingest Attachment Processor to index pdf files Elasticsearch	27	2550	October 12, 2018
Indexing many pdf files Elasticsearch	11	8456	May 19, 2018

Pdf documents specified in the sitemap are not being indexed by web crawler

Related topics