My task is a full-text search system for a really large amount of documents (tens of millions). Now I have documents as RTF file and their metadata, so all this will be indexed in Elasticsearch. These documents are unchangeable (they can be only deleted). I don't really expect many new documents per day and I choose the time these documents are inserted. So is it a good idea to use elastic as primary DB in this case?
Maybe I'll store the RTF file separately, but I really don't see the point of storing all this data somewhere else.
Elasticsearch should be fine for this use-case. Just be sure to keep snapshots. Keeping the original documents is a good idea, so you're able to completely rebuild the indices if the worst happens, or you hit bugs in ingest and need to start from scratch.
Depending on how you define 'really large amount', Elasticsearch could be overkill. If you don't need to distribute the data over multiple nodes and have some Java expertise, using vanilla Lucene is worth thinking about.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.