Mapper attachment plugin vs. pre-parsing and extracting content from binary files

In order to search binary files, I have two options that I see:

  1. Use the Elastic Mapping Plugin and have Elasticsearch handle all the things.

  2. Pre-parse and extract content from binary file. Send parsed content to Elasticsearch.

Number 2 seems like a perfectly viable solution. What's the advantage of using the mapper plugin?

Mapper plugin is removed in 6.0. Ingest attachment should be used instead.

But TBH I prefer your solution number 2. It's what I do in FSCrawler project.

Can you elaborate on why you would choose to parse the documents yourself and not let the ingest-attachment plug-in do the work?

Mainly because of some jar conflicts (jarhell checks) we had to reduce the surface of what actually Tika can extract (supported files).
So if you prefer having a full support of all supported files by Tika, doing that externally will help.

Also, some advanced features like using Tesseract OCR are not be possible with ingest-attachment plugin.

Awesome thanks. Would love to hear from someone from Elastic on this.

lol

http://david.pilato.fr/blog/2017/01/09/4-years-at-elastic/

Sorry about that! Did not read your profile.

David, are there any advantages to the first approach?

No worries! Was funny to read :slight_smile:

The main advantage is that you don't write/maintain the code.

If you are using ingest-attachment instead of mapper-attachments (removed in 6.0), another advantage is that you can dedicate some nodes as ingest nodes and then share the load on multiple nodes.

Awesome. Totally makes sense. Thanks.

thanks for your input!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.