Elasticsearch considerations for ingesting large files

Hello Team,

Just wanted to understand the limits , scaling and performance of Elasticsearch, what should be the considerations while ingesting large files (40-50) GB, metadata and making it searchable.
For examples microsoft office documents , Images , zip files with more than 30-50 GB size.
what is the best way to get near real time search experience.

Thanks for your help and support

Thanks and Regards,
Aditya Deshpande

Elasticsearch is not designed to store big binary blobs.
If you can extract the text and index just the text though that's another story.

For ZIP files, that's another story. I'd not try to index the content of a whole zip but I'd more index individual files within the ZIP instead. So unzip, index each file.

You can have a look at FSCrawler project BTW which can help you started with binary documents.
If your documents are stored in Sharepoint, Dropbox, ... you should look at Workplace Search.

Thank you very much dadoonet,

I will try to explore Workplace Search. could you please any technical documents available for this. will it work if my documenst are on harddrive.
meanwhile could you please help me understand if i want to extract the text and index it into elasticsearch lets say using fscrawler and ocr how can i achieve near real time search experience. parsing big files and extracting text would be time / memory and cpu consuming tasks. is there any way we can do it efficiently.

Thank you for your help and support.

Thank you,
Aditya

Not yet. But I'm planning to connect FSCrawler with it.

Well. Just start from Tutorial — FSCrawler 2.10-SNAPSHOT documentation

1 Like

Thank you very much dadoonet,
I think fscrawler should be sufficient to crawl contents on s3 bucket.
few queries FS Crawler:

  1. is there any limit (per file) on sending crawled data from fscrawler to elasticsearch, since this transfer would happen over http/s.
  2. Does FSCrawler supports parallel processing ?
    If multiples files are getting uploaded to the drive will it try to crawl all with mutiple
    threads or synchronously one after another.
  3. is it a good idea storing huge data on elasticsearch , i assume 40 GB file parsed and indexed in elasticsearch would approximately occupy the same sapce.

Thank you,
Aditya

Yes. See Elasticsearch settings — FSCrawler 2.10-SNAPSHOT documentation

No. See

But you can run multiple instances of FSCrawler in parallel to monitor different sub dirs.

No. I don't think it is. You should only store the extracted text or a part of the extracted text. By default, FSCrawler extracts only 10000 characters. See Local FS settings — FSCrawler 2.10-SNAPSHOT documentation

Thank you very much dadoonet.
It really helps :slight_smile:

Thank you,
Aditya

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.