Add large attachments from S3 to Elastic Search

Reghu_Shanker · September 6, 2021, 12:24pm

Hi,

I have large pdf files which are located in AWS S3 which needs to be indexed and searchable. I want to upload these files to elastic search. The files need to be uploaded as single documents.
How can I do this from my NodeJs application?

I saw filebeats, logstash and fscrawler or can I use bulk api.

Any help would be really great. I am new elastic search so having a tough time with this.

Thanks,
Regs

erictung · September 6, 2021, 12:45pm

I didn't try it before, but it looks like you need to install the Ingest Attachment Processor plugin first in Elasticsearch:

And then, use the attachment processor in your ingest pipeline to process it. The PDF file must be encoded into base64 format before passing it to Elasticsearch.

Reghu_Shanker · September 6, 2021, 12:59pm

@erictung I have already installed Ingest Attachment Processor and used it to extract text data from small pdf files which I uploaded to elastic search.

I am not sure how to upload large pdf files greater than 25MB to elastic search. What is the best way to do it.

Is there a connector which I can use, so I can upload encoded data in streaming format.
Should I download S3 document into my nodejs and then use a single put/post to elastic search?

dadoonet · September 6, 2021, 2:17pm

fscrawler does not support yet reading data from S3. A feature request is opened for this:

If you are using NodeJS, you can probably read the binary from S3 and then send it to elasticsearch. The "problem" is that this will require a lot of memory on Elasticsearch side as it has to get the JSON content, extract the binary from the BASE64, send it to Tika, create the final field, and index all that. So for a 25mb of data, that will require a lot of memory IMO.

ingest-attachment is perfect for small files. I'd not use for big files.
FWIW, I'd use in that case dedicated ingest nodes which are not holding any data. That way, if the node is overloaded, it can die without any impact on the data...

You can run FSCrawler as a REST service. This might work and at least this won't put the pressure on elasticsearch data nodes.

In the future, if the S3 input for FSCrawler is available that could solve something like your use case.

That made me think of another user story I have been asked for for a while. I just described it here:

github.com/dadoonet/fscrawler

Read from any FS Provider using the REST Service

opened 02:15PM - 06 Sep 21 UTC

dadoonet

feature_request

**Is your feature request related to a problem? Please describe.** We want to… be able to send commands to FSCrawler which could fetch a file from any provider like the local FS where FSCrawler is running or S3... **Describe the solution you'd like** ```sh curl -XPOST http://127.0.0.1:8080/fscrawler/_upload -d '{ "type": "fs", "fs": { "url": "file://foo/bar.txt" } } ``` ```sh curl -XPOST http://127.0.0.1:8080/fscrawler/_upload -d '{ "type": "s3", "s3": { "url": "s3://foo/bar.txt" } } ```

I think it would make sense to implement something like that. I need to think about it.

Reghu_Shanker · September 6, 2021, 2:42pm

Thanks for the info! What do you think about File Beats, Logstash and Bulk API. Can these be used?

dadoonet · September 7, 2021, 3:37am

Filebeat, Logstash are not built for that.

The bulk API is useful if you build your own solution. FSCrawler is using the Bulk API to write documents to Elasticsearch.

system · October 5, 2021, 3:38am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Full Content Search within documents in S3 Elasticsearch	3	3022	February 2, 2017
Attachments Plugin: Who uses it? Elasticsearch	3	323	July 6, 2017
Can I upload pdf file directly instead of Base64 data Elasticsearch	8	2581	April 2, 2019
Bulk API via S3 Elasticsearch	2	1745	July 5, 2017
Attachments streaming? Elasticsearch	2	256	July 6, 2017

Add large attachments from S3 to Elastic Search

Related topics