ES as noSQL (with peace of mind)

Hey everyone:

Like many here, we are exploring using ES as a NoSQL solution for a
large content storage solution - millions of documents of varying and
changeable types running on 2-50 server clusters with large batches of
content imported at different times and ongoing content updates. From
everything I've been reading, it is possible to use ES as a noSQL
solution so but there are some potential gotchas.

First, backup. My understanding is that the current backup solution
is "distributed" in that we should use local gateways and backup the
work directory on each node (from what I've read, using a networked
gateway can work to saturate a network link and in a large cluster
that could be problematic). When restoring a cluster, we would need to
restore enough nodes to meet the shard/replica settings before the
index would unblock. Got it.

Question #1: is it possible to have a more centralized backup
solution? Something that would use a scan search to a remote (backup)
node on a periodic basis? On failure, we could point to that node and
rejoin other nodes to it to grow it out? Or maybe have a separate
index called backup that only lives on a single node and have a
percolator that would copy any activity from the distributed indexes
to the backup index. This would allow us to backup the local
gateway so we could be confident we got everything?

Second and very related is persistance. For peace of mind, we were
either going to persist the content objects in a db as a blob or on
the filesystem. That way, on catastrophic failure, we could rebuild
the index/content store by batch reindexing these blobs/files into a
new index. Would it be better to write a "river" that adds the
content to the index when it is added to persistance or have a
"percolator" that could be fired when adding to the index that would
then persist the changed content? Is the percolator a valid option
here for a backup and/or a persistance strategy?

On Wed, Jul 20, 2011 at 6:45 PM, Will Ezell will@dotcms.com wrote:

Hey everyone:

Like many here, we are exploring using ES as a NoSQL solution for a
large content storage solution - millions of documents of varying and
changeable types running on 2-50 server clusters with large batches of
content imported at different times and ongoing content updates. From
everything I've been reading, it is possible to use ES as a noSQL
solution so but there are some potential gotchas.

First, backup. My understanding is that the current backup solution
is "distributed" in that we should use local gateways and backup the
work directory on each node (from what I've read, using a networked
gateway can work to saturate a network link and in a large cluster
that could be problematic). When restoring a cluster, we would need to
restore enough nodes to meet the shard/replica settings before the
index would unblock. Got it.

Question #1: is it possible to have a more centralized backup
solution? Something that would use a scan search to a remote (backup)
node on a periodic basis? On failure, we could point to that node and
rejoin other nodes to it to grow it out? Or maybe have a separate
index called backup that only lives on a single node and have a
percolator that would copy any activity from the distributed indexes
to the backup index. This would allow us to backup the local
gateway so we could be confident we got everything?

Actually, backup is something that i have been thinking about for some time
when it comes to local gateway. What i was thinking about is combining the
"snapshotting' feature of the shared gateway with local gateway. What I mean
is, be able to snapshot an index at a point in time into a shared storage,
and possibly recover it from it later on. That does require work though...
:slight_smile:

Second and very related is persistance. For peace of mind, we were
either going to persist the content objects in a db as a blob or on
the filesystem. That way, on catastrophic failure, we could rebuild
the index/content store by batch reindexing these blobs/files into a
new index. Would it be better to write a "river" that adds the
content to the index when it is added to persistance or have a
"percolator" that could be fired when adding to the index that would
then persist the changed content? Is the percolator a valid option
here for a backup and/or a persistance strategy?

I don't think percolator makes sense here, or a river for that matter. You
can do it on your app level, should be simplest?

First, backup. (...)
Second and very related is persistance.

For what it's worth: we're using ES in a "database" mode as well, not
really touching other storage for querying and displaying data/
aggregations.

Because of questions like yours, we're storing our data in CouchDB as
well. It has a stellar reputation in regards of durability, it is a
schema-free, JSON-based database, and it is very easy to pull data
from CouchDB thanks to its _changes interface.

Once a 100% durable, append-only, crash-resistant, replicated backup
solution for ES appears, we'll consider moving away from this setup...
But truth to be told, we're quite happy with CouchDB as the backup
storage.

Karel