I have large pdf files which are located in AWS S3 which needs to be indexed and searchable. I want to upload these files to elastic search. The files need to be uploaded as single documents.
How can I do this from my NodeJs application?
I saw filebeats, logstash and fscrawler or can I use bulk api.
Any help would be really great. I am new elastic search so having a tough time with this.
I didn't try it before, but it looks like you need to install the Ingest Attachment Processor plugin first in Elasticsearch:
And then, use the attachment processor in your ingest pipeline to process it. The PDF file must be encoded into base64 format before passing it to Elasticsearch.
@erictung I have already installed Ingest Attachment Processor and used it to extract text data from small pdf files which I uploaded to elastic search.
I am not sure how to upload large pdf files greater than 25MB to elastic search. What is the best way to do it.
Is there a connector which I can use, so I can upload encoded data in streaming format.
Should I download S3 document into my nodejs and then use a single put/post to elastic search?
fscrawler does not support yet reading data from S3. A feature request is opened for this:
If you are using NodeJS, you can probably read the binary from S3 and then send it to elasticsearch. The "problem" is that this will require a lot of memory on Elasticsearch side as it has to get the JSON content, extract the binary from the BASE64, send it to Tika, create the final field, and index all that. So for a 25mb of data, that will require a lot of memory IMO.
ingest-attachment is perfect for small files. I'd not use for big files.
FWIW, I'd use in that case dedicated ingest nodes which are not holding any data. That way, if the node is overloaded, it can die without any impact on the data...
You can run FSCrawler as a REST service. This might work and at least this won't put the pressure on elasticsearch data nodes.
In the future, if the S3 input for FSCrawler is available that could solve something like your use case.
That made me think of another user story I have been asked for for a while. I just described it here:
I think it would make sense to implement something like that. I need to think about it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.