Is it inefficient to index PDF files in Elasticsearch


(Aniket Kulkarni) #1

I want to index PDF file into elastic search. I saw that there is a plugin available to do that task.

https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest-attachment.html

But I want to know is indexing PDF content over indexing text inefficient?
(assuming pdf content is already available as text)

What are the cons of indexing PDF using the above plugin or Apache tika over indexing text?


(David Pilato) #2

If you already have the text, just index the text and don't use ingest-attachment.

But if the question is should I use this plugin or Tika, then you have to know that this plugin is using Tika (actually a part of Tika).

For FSCrawler project, I choose to call Tika by myself.


(Aniket Kulkarni) #3

@dadoonet Thank you for your reply. There is an xml available which contains the text. But inorder to get the text need to apply transforms on that XML. Which is doable but just an extra step in the process. And I also have the same text available in PDF format.
So which would be more efficient? Is the pain of extracting the text from XML worth over indexing the already available PDF using the plugin?


(David Pilato) #4

The more you do before indexing in elasticsearch the better.

So if you can provide the text already that's less work to do by elasticsearch. Also less memory consumed.


(Aniket Kulkarni) #5

@dadoonet Sure then we will consider text. But just for deeper understanding, what do you mean by less memory consumed. I mean while creating the mapping, we can say to exclude to store the pdf content field in _source right.(as mentioned here https://qbox.io/blog/index-attachments-files-elasticsearch-mapper). Would it be still less memory consumed?


(David Pilato) #6

When you send a PDF doc as BASE64, elasticsearch has to:

  • Receive the JSON document ans keep it in memory
  • Extract the BASE64 String to memory
  • Process it with Tika and create a String out of it (keep in RAM)
  • Update the JSON document with that content.

Which means somehow that you are consuming much more memory than when only doing the last step.


(Aniket Kulkarni) #7

@dadoonet Sure. Whether there will be any significant performance impact while search or querying back the indexed pdf document? And what if the size of PDF documents is as big as 300 or 500 MB. Will it still work?


(David Pilato) #8

At the end with ingest (attachment + remove processors) or not, you will have a document which contains:

{
  "content": "your text here"
}

The search will be the same whatever how it has been ingested.


(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.