I have large pdf files which are located in AWS S3 which needs to be indexed and searchable. I want to upload these files to elastic search. The files need to be uploaded as single documents.
How can I do this from my NodeJs application?
I saw filebeats, logstash and fscrawler or can I use bulk api.
Any help would be really great. I am new elastic search so having a tough time with this.
fscrawler does not support yet reading data from S3. A feature request is opened for this:
If you are using NodeJS, you can probably read the binary from S3 and then send it to elasticsearch. The "problem" is that this will require a lot of memory on Elasticsearch side as it has to get the JSON content, extract the binary from the BASE64, send it to Tika, create the final field, and index all that. So for a 25mb of data, that will require a lot of memory IMO.
ingest-attachment is perfect for small files. I'd not use for big files.
FWIW, I'd use in that case dedicated ingest nodes which are not holding any data. That way, if the node is overloaded, it can die without any impact on the data...
You can run FSCrawler as a REST service. This might work and at least this won't put the pressure on elasticsearch data nodes.
In the future, if the S3 input for FSCrawler is available that could solve something like your use case.
That made me think of another user story I have been asked for for a while. I just described it here:
I think it would make sense to implement something like that. I need to think about it.