Is it necessary to use Ingest Attachment Processor to index pdf files

+1. Feel free to drop a note at user@tika.apache.org if you have questions or please open an issue on our JIRA (https://issues.apache.org/jira/projects/TIKA/summary) if you find problems. Cheers, Tim

Hi @Tim_Allison

Any idea on Converting doc file to pdf or extract content from doc file page by page.

Please suggest

Thanks for your time always :slight_smile:

-Rahul

Doc and docx are, unfortunately, paragraph based not page/coordinate based. Tika doesn’t calculate page breaks in doc/x. You might drop a note the the Apache POI user list or see what you can find via Google...sorry I can’t help.

Thanks @Tim_Allison

No problem.

How about reading the text from doc files and then converting them into pdfs? Is it possible? I tried with python but couldn’t make it.

This might be a lead, but I have no experience with it: https://stackoverflow.com/questions/50982064/converting-docx-to-pdf-with-pure-python-on-linux-without-libreoffice

Thanks much for your time @Tim_Allison

Will work on it and will share the results :slight_smile:

Thanks again

+1 Definitely ask on the Apache POI user list...those folks know their MSOffice formats. :smiley:

Sure @Tim_Allison

Thank you =D

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.