Pdf documents specified in the sitemap are not being indexed by web crawler

vsachdeva · June 14, 2024, 12:23am

Hello,
I am trying to index a set of PDF documents using the web crawler. The deployment is in GCP cloud and the PDF documents are specified in the sitemap, which is documented in the robots.txt file. I am not using workspace solution. Do I need to define an attachment processor in the ingestion pipeline? Thanks

vsachdeva · June 14, 2024, 3:58am

The log explorer is showing the message: Unexpected content type application/pdf for a crawl task with type=content
for each pdf document in the sitemap.

vsachdeva · June 14, 2024, 4:01am

I believe the issue should be resolved as per the documentation specified here:

I will update the web crawler configuration and provide an update.

Topic		Replies	Views
PDF's are not getting indexed in AppSearch when crawling it using AppSearch Web Crawler Elastic Search elastic-app-search	7	901	May 17, 2023
Document attachment question Elastic Search elastic-app-search	10	749	September 27, 2022
Index PDF in ES Elasticsearch	14	9243	April 24, 2017
Is it necessary to use Ingest Attachment Processor to index pdf files Elasticsearch	28	2518	November 9, 2018
Indexing many pdf files Elasticsearch	12	8423	June 16, 2018

Pdf documents specified in the sitemap are not being indexed by web crawler

Related topics