PDF's are not getting indexed in AppSearch when crawling it using AppSearch Web Crawler

Disha_Bodade · April 13, 2023, 1:40pm

Hi Team,
I have some PDF documents in my domain sitemap, and while crawling it, Appsearch web crawler logs shows message.

[2023-04-13T08:11:53.831+00:00][844759][49052][crawler][WARN]: [crawl:6437b871e4f7669ec8fc6df1] [primary] Processed crawl results from the page 'https://example/en/documents/example.pdf?t=0' via the app_search output. Outcome: failure. Message: Failed to index the document into App Search: errors=["Unable to save document"]

I think appsearch indices are auto align with ingest pipeline app_search_cralwer, still I applied it as a default pipeline and tried to crawl, but still the issue perists.

Also added below settings in enterprise-search.config

crawler.content_extraction.enabled: true
crawler.content_extraction.mime_types: ["application/pdf", "application/msword", "text/plain", "application/xml", "text/html", "text/css"

Does anyone has any idea, if I am missing something in configuration.

Thanks,
Disha

Sean_Story · April 13, 2023, 1:48pm

Hi @Disha_Bodade ,

Your configurations look right to me. The error message that you shared is letting us know that something went wrong somewhere between the crawler and actually indexing the document, but a lot happens during that window. Is there anything else in the logs around that line? Looking at the code, I'd expect there to be a full stack trace which should tell us exactly where the error occurred.

Disha_Bodade · April 13, 2023, 2:09pm

Hi @Sean_Story ,
I have enabled debug log level for appsearch crawler, and below are the debug logs while fetching pdf.

[2023-04-13T13:57:10.658+00:00][897454][28972][crawler][DEBUG]: [crawl:643809eee4f76630354bca22] [primary] Crawl task progress: <CrawlTask: url=https://www.example.com/en/documents/state-of-louisiana-naspo-valuepoint-contract.pdf?t=0, type=content, depth=1, redirect_count=0, auth=none>: processing result

[2023-04-13T13:57:10.659+00:00][897454][28972][crawler][DEBUG]: [crawl:643809eee4f76630354bca22] [primary] Crawl task progress: <CrawlTask: url=https://www.example.com/en/documents/state-of-louisiana-naspo-valuepoint-contract.pdf?t=0, type=content, depth=1, redirect_count=0, auth=none>: extracting links

[2023-04-13T13:57:10.659+00:00][897454][28972][crawler][DEBUG]: [crawl:643809eee4f76630354bca22] [primary] Crawl task progress: <CrawlTask: url=https://www.example.com/en/documents/state-of-louisiana-naspo-valuepoint-contract-for.pdf?t=0, type=content, depth=1, redirect_count=0, auth=none>: ingesting the result

[2023-04-13T13:57:10.864+00:00][897454][28972][crawler][INFO]: [app_search] [engine:6437e2a0e4f76648fb4ba27a] [crawl:643809eee4f76630354bca22] [primary] Indexing a document into App Search: id=6437e395e4f766d58d4ba709, url=https://www.avaya.com/en/documents/state-of-louisiana-naspo-valuepoint-contract.pdf?t=0
[2023-04-13T13:57:10.956+00:00][897454][28972][crawler][WARN]: [crawl:643809eee4f76630354bca22] [primary] Processed crawl results from the page 'https://www.example.com/en/documents/state-of-louisiana-naspo-valuepoint-contract.pdf?t=0' via the app_search output. Outcome: failure. Message: Failed to index the document into App Search: errors=["Unable to save document"].

I don't see any full stack trace after enabling debug mode.

Thanks,
Disha

Sean_Story · April 13, 2023, 2:23pm

Hi Disha,

Are you filtering only log messages that have [crawler] or looking only at crawler.log? I think the stack trace would actually go through the app-server logger (logs/app-server.log) due to how our exception handling pipeline works.

Disha_Bodade · April 14, 2023, 3:12pm

@Sean_Story ,
Thanks for your suggestion to check app-server.log, It has resolved.
I have added a Date preprocessor on one of the date field, but that date field was not present in pdf documents source. So it was failing on indexing.

After adding the condition based preprocessor in ingest pipeline it started working for both pdf and other pages.

Thanks,
Disha

Serena_Chou · April 18, 2023, 2:11pm

@Disha_Bodade I'm interested in your experience in customizing the pipeline for this use case. Would you be interested in giving the product team some feedback?

Disha_Bodade · April 19, 2023, 11:18am

@Serena_Chou,
It would be great if attachment processor is able to generate some exported fields ["content", "title", "date", "modified", "modifier"] as suggested by elasticsearch documentation.

The properties field of processor as shown below.

 "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "properties": [ "content", "title", "date", "modified", "modifier"],
        "remove_binary": false
      }
    }
  ]

Thanks,
Disha

system · May 17, 2023, 11:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexed Documents not visible In AppSearch UI Elastic Search elastic-app-search	2	494	October 28, 2020
Pdf documents specified in the sitemap are not being indexed by web crawler Elasticsearch	2	85	June 14, 2024
"Unable to save document" while indexing document Elastic Search	2	1068	November 4, 2022
How to index files? Elastic Search	2	236	November 4, 2022
How to index PDF file data and search data from attachment PDF file Elastic Search elastic-app-search	7	7780	March 29, 2021

PDF's are not getting indexed in AppSearch when crawling it using AppSearch Web Crawler

Related topics