Like many here, we are exploring using ES as a NoSQL solution for a
large content storage solution - millions of documents of varying and
changeable types running on 2-50 server clusters with large batches of
content imported at different times and ongoing content updates. From
everything I've been reading, it is possible to use ES as a noSQL
solution so but there are some potential gotchas.
First, backup. My understanding is that the current backup solution
is "distributed" in that we should use local gateways and backup the
work directory on each node (from what I've read, using a networked
gateway can work to saturate a network link and in a large cluster
that could be problematic). When restoring a cluster, we would need to
restore enough nodes to meet the shard/replica settings before the
index would unblock. Got it.
Question #1: is it possible to have a more centralized backup
solution? Something that would use a scan search to a remote (backup)
node on a periodic basis? On failure, we could point to that node and
rejoin other nodes to it to grow it out? Or maybe have a separate
index called backup that only lives on a single node and have a
percolator that would copy any activity from the distributed indexes
to the backup index. This would allow us to backup the local
gateway so we could be confident we got everything?
Second and very related is persistance. For peace of mind, we were
either going to persist the content objects in a db as a blob or on
the filesystem. That way, on catastrophic failure, we could rebuild
the index/content store by batch reindexing these blobs/files into a
new index. Would it be better to write a "river" that adds the
content to the index when it is added to persistance or have a
"percolator" that could be fired when adding to the index that would
then persist the changed content? Is the percolator a valid option
here for a backup and/or a persistance strategy?