Is it inefficient to index PDF files in Elasticsearch

aniketkk · July 27, 2017, 3:53pm

I want to index PDF file into elastic search. I saw that there is a plugin available to do that task.

https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest-attachment.html

But I want to know is indexing PDF content over indexing text inefficient?
(assuming pdf content is already available as text)

What are the cons of indexing PDF using the above plugin or Apache tika over indexing text?

dadoonet · July 27, 2017, 4:16pm

If you already have the text, just index the text and don't use ingest-attachment.

But if the question is should I use this plugin or Tika, then you have to know that this plugin is using Tika (actually a part of Tika).

For FSCrawler project, I choose to call Tika by myself.

aniketkk · July 28, 2017, 4:13am

@dadoonet Thank you for your reply. There is an xml available which contains the text. But inorder to get the text need to apply transforms on that XML. Which is doable but just an extra step in the process. And I also have the same text available in PDF format.
So which would be more efficient? Is the pain of extracting the text from XML worth over indexing the already available PDF using the plugin?

dadoonet · July 28, 2017, 5:01am

The more you do before indexing in elasticsearch the better.

So if you can provide the text already that's less work to do by elasticsearch. Also less memory consumed.

aniketkk · July 28, 2017, 9:29am

@dadoonet Sure then we will consider text. But just for deeper understanding, what do you mean by less memory consumed. I mean while creating the mapping, we can say to exclude to store the pdf content field in _source right.(as mentioned here https://qbox.io/blog/index-attachments-files-elasticsearch-mapper). Would it be still less memory consumed?

dadoonet · July 28, 2017, 10:07am

When you send a PDF doc as BASE64, elasticsearch has to:

Receive the JSON document ans keep it in memory
Extract the BASE64 String to memory
Process it with Tika and create a String out of it (keep in RAM)
Update the JSON document with that content.

Which means somehow that you are consuming much more memory than when only doing the last step.

aniketkk · July 28, 2017, 10:30am

@dadoonet Sure. Whether there will be any significant performance impact while search or querying back the indexed pdf document? And what if the size of PDF documents is as big as 300 or 500 MB. Will it still work?

dadoonet · July 28, 2017, 11:02am

At the end with ingest (attachment + remove processors) or not, you will have a document which contains:

{
  "content": "your text here"
}

The search will be the same whatever how it has been ingested.

system · August 25, 2017, 11:03am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing pdf, word, text, image files Elasticsearch	2	677	April 27, 2017
Is it necessary to use Ingest Attachment Processor to index pdf files Elasticsearch	28	2355	November 9, 2018
Best way to use Ingest Attachment Plugin Elasticsearch ingest-pipeline	4	514	December 31, 2021
Advantages of base64 encoded content in ingest attachment plugin Elasticsearch	3	1571	May 1, 2018
Indexing all pdfs within a folder Elasticsearch	2	462	December 12, 2018

Is it inefficient to index PDF files in Elasticsearch

Related topics