Mapper attachment plugin vs. pre-parsing and extracting content from binary files


(Taylor Lovett) #1

In order to search binary files, I have two options that I see:

  1. Use the Elastic Mapping Plugin and have Elasticsearch handle all the things.

  2. Pre-parse and extract content from binary file. Send parsed content to Elasticsearch.

Number 2 seems like a perfectly viable solution. What's the advantage of using the mapper plugin?


How to specify file to Ingest Attachment
(David Pilato) #2

Mapper plugin is removed in 6.0. Ingest attachment should be used instead.

But TBH I prefer your solution number 2. It's what I do in FSCrawler project.


(David Pocivalnik) #3

Can you elaborate on why you would choose to parse the documents yourself and not let the ingest-attachment plug-in do the work?


(David Pilato) #4

Mainly because of some jar conflicts (jarhell checks) we had to reduce the surface of what actually Tika can extract (supported files).
So if you prefer having a full support of all supported files by Tika, doing that externally will help.

Also, some advanced features like using Tesseract OCR are not be possible with ingest-attachment plugin.


(Taylor Lovett) #5

Awesome thanks. Would love to hear from someone from Elastic on this.


(David Pilato) #6

lol

http://david.pilato.fr/blog/2017/01/09/4-years-at-elastic/


(Taylor Lovett) #7

Sorry about that! Did not read your profile.


(Taylor Lovett) #8

David, are there any advantages to the first approach?


(David Pilato) #9

No worries! Was funny to read :slight_smile:


(David Pilato) #10

The main advantage is that you don't write/maintain the code.

If you are using ingest-attachment instead of mapper-attachments (removed in 6.0), another advantage is that you can dedicate some nodes as ingest nodes and then share the load on multiple nodes.


(Taylor Lovett) #11

Awesome. Totally makes sense. Thanks.


(David Pocivalnik) #12

thanks for your input!


(system) #13

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.