PDF's are not getting indexed in AppSearch when crawling it using AppSearch Web Crawler

Hi Team,
I have some PDF documents in my domain sitemap, and while crawling it, Appsearch web crawler logs shows message.

[2023-04-13T08:11:53.831+00:00][844759][49052][crawler][WARN]: [crawl:6437b871e4f7669ec8fc6df1] [primary] Processed crawl results from the page 'https://example/en/documents/example.pdf?t=0' via the app_search output. Outcome: failure. Message: Failed to index the document into App Search: errors=["Unable to save document"]

I think appsearch indices are auto align with ingest pipeline app_search_cralwer, still I applied it as a default pipeline and tried to crawl, but still the issue perists.

Also added below settings in enterprise-search.config

crawler.content_extraction.enabled: true
crawler.content_extraction.mime_types: ["application/pdf", "application/msword", "text/plain", "application/xml", "text/html", "text/css"

Does anyone has any idea, if I am missing something in configuration.

Thanks,
Disha

Hi @Disha_Bodade ,

Your configurations look right to me. The error message that you shared is letting us know that something went wrong somewhere between the crawler and actually indexing the document, but a lot happens during that window. Is there anything else in the logs around that line? Looking at the code, I'd expect there to be a full stack trace which should tell us exactly where the error occurred.

Hi @Sean_Story ,
I have enabled debug log level for appsearch crawler, and below are the debug logs while fetching pdf.

[2023-04-13T13:57:10.658+00:00][897454][28972][crawler][DEBUG]: [crawl:643809eee4f76630354bca22] [primary] Crawl task progress: <CrawlTask: url=https://www.example.com/en/documents/state-of-louisiana-naspo-valuepoint-contract.pdf?t=0, type=content, depth=1, redirect_count=0, auth=none>: processing result

[2023-04-13T13:57:10.659+00:00][897454][28972][crawler][DEBUG]: [crawl:643809eee4f76630354bca22] [primary] Crawl task progress: <CrawlTask: url=https://www.example.com/en/documents/state-of-louisiana-naspo-valuepoint-contract.pdf?t=0, type=content, depth=1, redirect_count=0, auth=none>: extracting links

[2023-04-13T13:57:10.659+00:00][897454][28972][crawler][DEBUG]: [crawl:643809eee4f76630354bca22] [primary] Crawl task progress: <CrawlTask: url=https://www.example.com/en/documents/state-of-louisiana-naspo-valuepoint-contract-for.pdf?t=0, type=content, depth=1, redirect_count=0, auth=none>: ingesting the result

[2023-04-13T13:57:10.864+00:00][897454][28972][crawler][INFO]: [app_search] [engine:6437e2a0e4f76648fb4ba27a] [crawl:643809eee4f76630354bca22] [primary] Indexing a document into App Search: id=6437e395e4f766d58d4ba709, url=https://www.avaya.com/en/documents/state-of-louisiana-naspo-valuepoint-contract.pdf?t=0
[2023-04-13T13:57:10.956+00:00][897454][28972][crawler][WARN]: [crawl:643809eee4f76630354bca22] [primary] Processed crawl results from the page 'https://www.example.com/en/documents/state-of-louisiana-naspo-valuepoint-contract.pdf?t=0' via the app_search output. Outcome: failure. Message: Failed to index the document into App Search: errors=["Unable to save document"].

I don't see any full stack trace after enabling debug mode.

Thanks,
Disha

Hi Disha,

Are you filtering only log messages that have [crawler] or looking only at crawler.log? I think the stack trace would actually go through the app-server logger (logs/app-server.log) due to how our exception handling pipeline works.

@Sean_Story ,
Thanks for your suggestion to check app-server.log, It has resolved.
I have added a Date preprocessor on one of the date field, but that date field was not present in pdf documents source. So it was failing on indexing.

After adding the condition based preprocessor in ingest pipeline it started working for both pdf and other pages.

Thanks,
Disha

1 Like

@Disha_Bodade I'm interested in your experience in customizing the pipeline for this use case. Would you be interested in giving the product team some feedback?

@Serena_Chou,
It would be great if attachment processor is able to generate some exported fields ["content", "title", "date", "modified", "modifier"] as suggested by elasticsearch documentation.

The properties field of processor as shown below.

 "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "properties": [ "content", "title", "date", "modified", "modifier"],
        "remove_binary": false
      }
    }
  ]

Thanks,
Disha

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.