Hi Team,
I have some PDF documents in my domain sitemap, and while crawling it, Appsearch web crawler logs shows message.
[2023-04-13T08:11:53.831+00:00][844759][49052][crawler][WARN]: [crawl:6437b871e4f7669ec8fc6df1] [primary] Processed crawl results from the page 'https://example/en/documents/example.pdf?t=0' via the app_search output. Outcome: failure. Message: Failed to index the document into App Search: errors=["Unable to save document"]
I think appsearch indices are auto align with ingest pipeline app_search_cralwer, still I applied it as a default pipeline and tried to crawl, but still the issue perists.
Also added below settings in enterprise-search.config
Your configurations look right to me. The error message that you shared is letting us know that something went wrong somewhere between the crawler and actually indexing the document, but a lot happens during that window. Is there anything else in the logs around that line? Looking at the code, I'd expect there to be a full stack trace which should tell us exactly where the error occurred.
Are you filtering only log messages that have [crawler] or looking only at crawler.log? I think the stack trace would actually go through the app-server logger (logs/app-server.log) due to how our exception handling pipeline works.
@Sean_Story ,
Thanks for your suggestion to check app-server.log, It has resolved.
I have added a Date preprocessor on one of the date field, but that date field was not present in pdf documents source. So it was failing on indexing.
After adding the condition based preprocessor in ingest pipeline it started working for both pdf and other pages.
@Disha_Bodade I'm interested in your experience in customizing the pipeline for this use case. Would you be interested in giving the product team some feedback?
@Serena_Chou,
It would be great if attachment processor is able to generate some exported fields ["content", "title", "date", "modified", "modifier"] as suggested by elasticsearch documentation.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.