How to index a large file with Elasticsearch

hasif.subair · December 3, 2015, 12:57pm

Hi,

I have a large file saved in pdf format in Hadoop. I want to index file contents and other file informations. I was looking around the web and finally stumbled on this here.

But here they are indexing the file by encoding the file with Base64 and saving the file to a json and creating the index from this json file.

Now what if my file has a size of say 1GB, wouldn't it take me a long time doing Base64 encoding on the file without using hadoop and creating index on it.

So my question what is most efficient way to index a large file in Hadoop with Elasticsearch.

dadoonet · December 3, 2015, 2:03pm

As you are probably coding in Java, you should definitely read the file from hadoop and send it to Apache Tika then to extract the text you want to index.

Then send this extracted text (and only that) to elasticsearch.

My 2 cents.

Topic		Replies	Views
How to index text files (pdf, doc, txt...) in Java? Elasticsearch	6	2632	January 18, 2023
Elasticsearch: Best Approach to Index files: Elasticsearch	1	497	October 9, 2018
Is it inefficient to index PDF files in Elasticsearch Elasticsearch	8	4136	August 25, 2017
Error while indexing large text file Elasticsearch	1	322	July 6, 2017
Saving the content of a file in an elasticsearch index using springboot RESTAPI Elasticsearch	1	662	April 3, 2023

How to index a large file with Elasticsearch

Related topics