Is it necessary to use Ingest Attachment Processor to index pdf files

Tim_Allison · September 26, 2018, 7:59pm

+1. Feel free to drop a note at user@tika.apache.org if you have questions or please open an issue on our JIRA (https://issues.apache.org/jira/projects/TIKA/summary) if you find problems. Cheers, Tim

rahulnama · October 10, 2018, 8:42am

Hi @Tim_Allison

Any idea on Converting doc file to pdf or extract content from doc file page by page.

Please suggest

Thanks for your time always

-Rahul

Tim_Allison · October 10, 2018, 12:06pm

Doc and docx are, unfortunately, paragraph based not page/coordinate based. Tika doesn’t calculate page breaks in doc/x. You might drop a note the the Apache POI user list or see what you can find via Google...sorry I can’t help.

rahulnama · October 10, 2018, 12:27pm

Thanks @Tim_Allison

No problem.

How about reading the text from doc files and then converting them into pdfs? Is it possible? I tried with python but couldn’t make it.

Tim_Allison · October 10, 2018, 1:37pm

This might be a lead, but I have no experience with it: https://stackoverflow.com/questions/50982064/converting-docx-to-pdf-with-pure-python-on-linux-without-libreoffice

rahulnama · October 10, 2018, 2:10pm

Thanks much for your time @Tim_Allison

Will work on it and will share the results

Thanks again

Tim_Allison · October 10, 2018, 2:40pm

+1 Definitely ask on the Apache POI user list...those folks know their MSOffice formats.

rahulnama · October 12, 2018, 4:40am

Sure @Tim_Allison

Thank you =D

system · November 9, 2018, 4:40am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Mapper-attachment vs Ingest-attachment with OCR Elasticsearch	3	1764	December 13, 2016
Best way to use Ingest Attachment Plugin Elasticsearch ingest-pipeline	4	508	December 31, 2021
Index PDF in ES Elasticsearch	14	9096	April 24, 2017
Is it inefficient to index PDF files in Elasticsearch Elasticsearch	8	4134	August 25, 2017
Advantages of base64 encoded content in ingest attachment plugin Elasticsearch	3	1568	May 1, 2018

Is it necessary to use Ingest Attachment Processor to index pdf files

Related topics