Just wanted to understand the limits , scaling and performance of Elasticsearch, what should be the considerations while ingesting large files (40-50) GB, metadata and making it searchable.
For examples microsoft office documents , Images , zip files with more than 30-50 GB size.
what is the best way to get near real time search experience.
Elasticsearch is not designed to store big binary blobs.
If you can extract the text and index just the text though that's another story.
For ZIP files, that's another story. I'd not try to index the content of a whole zip but I'd more index individual files within the ZIP instead. So unzip, index each file.
You can have a look at FSCrawler project BTW which can help you started with binary documents.
If your documents are stored in Sharepoint, Dropbox, ... you should look at Workplace Search.
I will try to explore Workplace Search. could you please any technical documents available for this. will it work if my documenst are on harddrive.
meanwhile could you please help me understand if i want to extract the text and index it into elasticsearch lets say using fscrawler and ocr how can i achieve near real time search experience. parsing big files and extracting text would be time / memory and cpu consuming tasks. is there any way we can do it efficiently.
Thank you very much dadoonet,
I think fscrawler should be sufficient to crawl contents on s3 bucket.
few queries FS Crawler:
is there any limit (per file) on sending crawled data from fscrawler to elasticsearch, since this transfer would happen over http/s.
Does FSCrawler supports parallel processing ?
If multiples files are getting uploaded to the drive will it try to crawl all with mutiple
threads or synchronously one after another.
is it a good idea storing huge data on elasticsearch , i assume 40 GB file parsed and indexed in elasticsearch would approximately occupy the same sapce.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.