Elasticsearch considerations for ingesting large files

adityaPsl · April 10, 2020, 9:22am

Hello Team,

Just wanted to understand the limits , scaling and performance of Elasticsearch, what should be the considerations while ingesting large files (40-50) GB, metadata and making it searchable.
For examples microsoft office documents , Images , zip files with more than 30-50 GB size.
what is the best way to get near real time search experience.

Thanks for your help and support

Thanks and Regards,
Aditya Deshpande

dadoonet · April 10, 2020, 9:37am

Elasticsearch is not designed to store big binary blobs.
If you can extract the text and index just the text though that's another story.

For ZIP files, that's another story. I'd not try to index the content of a whole zip but I'd more index individual files within the ZIP instead. So unzip, index each file.

You can have a look at FSCrawler project BTW which can help you started with binary documents.
If your documents are stored in Sharepoint, Dropbox, ... you should look at Workplace Search.

adityaPsl · April 10, 2020, 10:10am

Thank you very much dadoonet,

I will try to explore Workplace Search. could you please any technical documents available for this. will it work if my documenst are on harddrive.
meanwhile could you please help me understand if i want to extract the text and index it into elasticsearch lets say using fscrawler and ocr how can i achieve near real time search experience. parsing big files and extracting text would be time / memory and cpu consuming tasks. is there any way we can do it efficiently.

Thank you for your help and support.

Thank you,
Aditya

dadoonet · April 10, 2020, 10:24am

Not yet. But I'm planning to connect FSCrawler with it.

Well. Just start from Tutorial — FSCrawler 2.10-SNAPSHOT documentation

adityaPsl · April 11, 2020, 2:42am

Thank you very much dadoonet,
I think fscrawler should be sufficient to crawl contents on s3 bucket.
few queries FS Crawler:

is there any limit (per file) on sending crawled data from fscrawler to elasticsearch, since this transfer would happen over http/s.
Does FSCrawler supports parallel processing ?
If multiples files are getting uploaded to the drive will it try to crawl all with mutiple
threads or synchronously one after another.
is it a good idea storing huge data on elasticsearch , i assume 40 GB file parsed and indexed in elasticsearch would approximately occupy the same sapce.

Thank you,
Aditya

dadoonet · April 11, 2020, 9:31am

Yes. See Elasticsearch settings — FSCrawler 2.10-SNAPSHOT documentation

No. See

But you can run multiple instances of FSCrawler in parallel to monitor different sub dirs.

No. I don't think it is. You should only store the extracted text or a part of the extracted text. By default, FSCrawler extracts only 10000 characters. See Local FS settings — FSCrawler 2.10-SNAPSHOT documentation

adityaPsl · April 11, 2020, 11:33am

Thank you very much dadoonet.
It really helps

Thank you,
Aditya

system · May 9, 2020, 11:33am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Large zip content Elasticsearch	5	1176	January 23, 2018
Fscrawler index large file Elasticsearch	11	797	May 18, 2018
Store binary files in elastic search Elasticsearch	9	3224	February 28, 2022
[Java] Stream large file while indexing Elasticsearch	10	2272	July 6, 2017
Streaming a large file Elasticsearch	3	1229	January 12, 2018

Elasticsearch considerations for ingesting large files

Related topics