Elasticsearch index on compressed string blob

I needed to index on compressed string blob..
I read that ingest attachment plugin could be the right tool for files.. and base64 strings..

ingest attachment plugin relies on appache tikka

I read that Tika supports:

"Tar, AR, ARJ, CPIO, Dump, Zip, 7Zip, Gzip, BZip2, XZ, LZMA, Z and Pack200"
In Apache Tika – Supported Document Formats under section ' Compression and packaging formats'.

But I understand for Elasticsearch we cannot use this tikka feature ... and cannot index on compressed formats.. and need to uncompress and only then store documents or string.

You can give a try to FSCrawler which supports all formats that Tika supports.

Got it. I understand need to use this to send data to elasticsearch... I can`t do anything for the already uploaded compressed blob content ?

If you can use the index attachment plugin, then you could use the reindex API to read the existing documents and reindex them into another index.

Ingest attachment plugin isn't able to read compressed files :frowning:

You mentioned in this earlier post that you are using some potentially unusual compressed format. As far as I know Tika and the ingest attachment plugin only handle more common binary formats, so I would probably recommend extracting the content to be indexed from your blobs at the client side if you want to search using Elasticsearch.

Hey @Christian, though I am open to change the compress format.. going forward too.. to common ones like gzip

Note that the ingest attachment plugin extracts the text to be indexed from the binary content and stores it on the document which will increase the size. I would therefore not expect any major efficiency gains by having the plugin do the extraction compared to you doing it before indexing the document.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.