I have users that all have hundreds of GB’s of documents (pdf, word, etc). I want to use Elastic with the ingest-attachment plugin to make the plaintext contents of these documents searchable. It’s unlikely these documents will eventually be deleted, so the dataset will mostly get bigger.
My question: is this a usecase that Elastic can safely handle? Or will I eventually run into a wall of sorts if the data set gets too big? The thing I’m mostly worried about is that searching for text will get slower if users upload too many documents.
Hundreds of GBs is not really a "large" dataset in Elasticsearch terms; you can search TBs of data with a single node and there are clusters out there containing PBs of data that ingest hundreds of GBs of new documents every hour.
Moreover PDFs and Word documents typically contain a lot of unsearchable overhead, so the size of the plaintext content that Elasticsearch actually sees is often many times smaller than the total file size.
I'm not saying that there are no scaling limits, of course, and this all depends on your usage pattern and performance goals too, but in terms of data size alone you're well within the comfort zone.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.