elasticSearch as a document database

Hi ,

I am modelling elastic search both as a search engine AND a document
database.
I am finding enough features in elastic DB to do this like

  1. Unique document ID
  2. Ability to store PDF or any other format and to index them.
  3. Versioning of documents.

And lots other.

My basic query is that is it fine if i go ahead with this.
Is there any possible issues that i might encounter while going ahead
with this model ?
Is ES as a document database included in the basic design of ES.

Let me the pros and cons of this approach ...

Thanks
Vineeth

http://elasticsearch-users.115913.n3.nabble.com/ElasticSearch-as-a-database-td2948572.html
http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/dcb1237b149bf5af/1802d01b9b3ccfd3?#1802d01b9b3ccfd3

Thanks a ton James.

The links really helped.
I would have been more happier if ES had "continuous changes"/polling
feature couchDB have.

Thanks
Vineeth

On Sat, Aug 13, 2011 at 4:15 PM, James Cook jcook@tracermedia.com wrote:

http://elasticsearch-users.115913.n3.nabble.com/ElasticSearch-as-a-database-td2948572.html

http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/dcb1237b149bf5af/1802d01b9b3ccfd3?#1802d01b9b3ccfd3

Personally, I like that ElasticSearch hasn't become too much of a kitchen
sink architecture. Obviously, there is a grey area there because I use it as
a database, although it is designed for search, so I wish realtime was there
and snapshotting. It will be interesting to see how it evolves.

Shay has intentionally not added things like a notification system and https
support. There are ways to support a lot of these features outside of the ES
core.

We're looking at ES from (I suspect) exactly the same perspective as
Vineeth. Our objective is to reduce the number of moving parts needed.
Have Googled and found a number of commentaries on this approach, including
the ones James linked to.

The one thing I haven't found a good answer to is the data integrity issue.
Haven't heard of recent Lucene index corruptions (assuming one isn't using
Java 7), but what are the real risks here?
We've been using couchdb-lucene for nearly two years without a hitch, but
with far too few documents (< 100k) and updates (half a dozen per doc, on
average) to have a good handle on this.
Is a lucene index recoverable in any way shape or form if it does go south?
using luke?
Would a snapshotting filesystem be an answer?
If an index goes south, are the problems replicated (is this a bit-level
replication) or will something in the replication process recognize
corruption at the index protocol level and prevent it from spreading?

I don't see how an external notification system can work unless all activity
is under control of a message broker, which adds a middleware layer we'd
like to avoid.
Shay has just written his thoughts on notifications in another thread.
Mostly it's about the implementation issues rather than an argument they
have no value. Nothing there surprises me, although I suspect he slightly
understated, if anything, the issues around maintaing wal integrity etc,
especially if one is depending on write quorums - a prime attraction for us.

Increasingly, we're finding that couchdb has become a queuing system for us,
actually, and that the external lucene indices are the primary source of
truth. If we can store the original document in the indices using ES, we may
not need couchdb and the moving parts count becomes smaller. I'm a little
concerned about the absence of a good reliability story in the java
ecosystem (ie erlang's supervisory trees) but at the highest level, I
suppose one uses Monit or something like that. Maybe even upstart or launchd
does it. And I see clues that Shay has included something like this inside
ES.

We use a master external evented listener that dispatches to a significant
number of command line invocations to perform all manner of actions
including temporal-based schema validations. It's attached to both couchdb
and Postgresql. This allows us to write individual tasks in a variety of
languages from bash to Python to Haskell - selection depending on the need
for external libraries usually. Call it sideware. This works extremely well
and allows exceptionally loose coupling.

One couchdb difficulty is the lack of chained map-reduce operations. I
wonder if ES can easily support the sort of recursive queries that would
ease the pain? dunno yet.

As ES has no concept (apparently) of access control, we'll be forced to use
some sort of access proxy - it could easily handle https too. Really want to
keep this lightweight.

We already live in a distributed world - our 'client' code is primarily web
apps running online and off. Most application logic is on the client.
These are really another species of distributed application and we've
learned to live happily without the crutch of referential integrity.
Given that a single http request is the basic unit of atomicity in this
world and that most business workflows are still modelled on paper documents
(hmmm, looks like a web page), an RDBMS creates a considerable impedance
mismatch. A document store works just fine here, and so far, ES seems to
make the querying a manageable beast.

I imagine, though, that for someone who thinks in terms of RDBMS the shift
can be just as difficult as, say, from imperative to functional programming.
Aside from index integrity, I suspect this is the real issue.

d.r.

My feedback will be strictly anecdotal. In fact, the majority of the
problems you may see people experience on the list are either user error or
a small bug which is quickly patched. There isn't a roadmap to say when 1.0
will be done, or what is the final feature set. This hasn't stopped many
organizations from using ES in production. We had started using it in
development since 0.5 in early 2010, and went into production with two
different applications; one with 0.11 and the other with 0.16. We havent run
into any trouble with resources, data loss/corruption, or performance.

In fact, during the summer of 2010 we were still running MongoDB and ES
until it became clear that it didn't make sense storing the same data in two
different data stores. ES had a much better story at that time regarding
sharding, replicas, and disk space issues. It was a no-brainer to pull the
plug on MongoDB. (I'm sure things are better for MongoDB by now.)

At the moment, I wouldn't run ES as my only data store without some way to
re-index data. For us, we run on Amazon EC2 and use the S3 gateway for long
term persistence. This is because our EC2 instances do not use EBS
(persistant hard disks), therefore when an instance goes away, so does the
local copy of the data. Of course, if we lose a primary shard and the nodes
carrying its replicas, the data is gone. This is why we use the S3 gateway.
ES keeps all of the data and cluster configuration there. If we lose every
node in the cluster, we can start the cluster back up and all the nodes will
initialize from the S3 gateway.

While Shay has done a good job of maintaining compatibility between release
versions, we haven't run into too many issues. I think I remember one
release where the data in the gateway was no longer compatible with an
upgrade version.

So, this brings us to the dirty word in Lucene-based storage, re-indexing.
Whether or not you ever experience data corruption, you will most likely run
into a situation where you have to re-index all of your data. It may be
caused by a incompatible release, a change in your mapping files, or a data
corruption problem. When this happens, you will most often have to re-index
all of your data.

The other threads discuss the high level abstractions that exist in some
databases but not in ES (NRT, transactions, WAN replication, snapshotting,
etc).

Right now, we write a copy of our ES data to S3 every 30 minutes in a simple
format of /// = byte. I would love to have a
feature where we can do this against a snapshot of the data instead of the
moving target we currently use. I don't have any experience with snapshotted
filesystems, but I am reading up on that now after you mentioned it.

-- jim

On Sat, Aug 13, 2011 at 8:16 PM, David Richardson <
david.richardson@enquora.com> wrote:

We're looking at ES from (I suspect) exactly the same perspective as
Vineeth. Our objective is to reduce the number of moving parts needed.
Have Googled and found a number of commentaries on this approach, including
the ones James linked to.

The one thing I haven't found a good answer to is the data integrity issue.
Haven't heard of recent Lucene index corruptions (assuming one isn't using
Java 7), but what are the real risks here?
We've been using couchdb-lucene for nearly two years without a hitch, but
with far too few documents (< 100k) and updates (half a dozen per doc, on
average) to have a good handle on this.
Is a lucene index recoverable in any way shape or form if it does go south?
using luke?
Would a snapshotting filesystem be an answer?
If an index goes south, are the problems replicated (is this a bit-level
replication) or will something in the replication process recognize
corruption at the index protocol level and prevent it from spreading?

I don't see how an external notification system can work unless all
activity is under control of a message broker, which adds a middleware layer
we'd like to avoid.
Shay has just written his thoughts on notifications in another thread.
Mostly it's about the implementation issues rather than an argument they
have no value. Nothing there surprises me, although I suspect he slightly
understated, if anything, the issues around maintaing wal integrity etc,
especially if one is depending on write quorums - a prime attraction for us.

Increasingly, we're finding that couchdb has become a queuing system for
us, actually, and that the external lucene indices are the primary source of
truth. If we can store the original document in the indices using ES, we may
not need couchdb and the moving parts count becomes smaller. I'm a little
concerned about the absence of a good reliability story in the java
ecosystem (ie erlang's supervisory trees) but at the highest level, I
suppose one uses Monit or something like that. Maybe even upstart or launchd
does it. And I see clues that Shay has included something like this inside
ES.

We use a master external evented listener that dispatches to a significant
number of command line invocations to perform all manner of actions
including temporal-based schema validations. It's attached to both couchdb
and Postgresql. This allows us to write individual tasks in a variety of
languages from bash to Python to Haskell - selection depending on the need
for external libraries usually. Call it sideware. This works extremely well
and allows exceptionally loose coupling.

One couchdb difficulty is the lack of chained map-reduce operations. I
wonder if ES can easily support the sort of recursive queries that would
ease the pain? dunno yet.

As ES has no concept (apparently) of access control, we'll be forced to use
some sort of access proxy - it could easily handle https too. Really want to
keep this lightweight.

We already live in a distributed world - our 'client' code is primarily web
apps running online and off. Most application logic is on the client.
These are really another species of distributed application and we've
learned to live happily without the crutch of referential integrity.
Given that a single http request is the basic unit of atomicity in this
world and that most business workflows are still modelled on paper documents
(hmmm, looks like a web page), an RDBMS creates a considerable impedance
mismatch. A document store works just fine here, and so far, ES seems to
make the querying a manageable beast.

I imagine, though, that for someone who thinks in terms of RDBMS the shift
can be just as difficult as, say, from imperative to functional programming.
Aside from index integrity, I suspect this is the real issue.

d.r.

I don't have any experience with snapshotted filesystems, but I am reading
up on that now after you mentioned it.
I should have mentioned it, but I'm thinking primarily of ZFS. There's btrfs
too. You could use the word modern instead of snapshotted :wink:

So, this brings us to the dirty word in Lucene-based storage, re-indexing.
Whether or not you ever experience data corruption, you will most likely run
into a situation where you have to re-index all of your data. It may be
caused by a incompatible release, a change in your mapping files, or a data
corruption problem. When this happens, you will most often have to re-index
all of your data.
Understood. Yes, changing mapping files is an issue that I was wondering
about. One of the attractions of couch (relative to RDBMS) is the ability to
quickly restore (via filtered replication) a subset of the full data set in
order to restore partial functionality much quicker. Rebuild views on recent
and critical reference data first to keep users from reaching for pitchforks
and torches. Seems to me a careful sharding approach provides much the same
opportunity in ES wrt re-indexing and disaster recovery. And I presume that
the answer to erlang's hot code reloading is to take nodes
individually offline for upgrading or other maintenance, leaving the cluster
fully functional.

d.r.

On Sat, Aug 13, 2011 at 11:01 PM, James Cook jcook@tracermedia.com wrote:

Personally, I like that Elasticsearch hasn't become too much of a kitchen
sink architecture. Obviously, there is a grey area there because I use it as
a database, although it is designed for search, so I wish realtime was there
and snapshotting. It will be interesting to see how it evolves.

Shay has intentionally not added things like a notification system and
https support. There are ways to support a lot of these features outside of
the ES core.

To expand on what James wrote. The current focus is on features that users
can't implement "outside" of elasticsearch. Thats not to say that things
like security won't eventually be in elasticsearch, it just that the main
focus is not there now.

To add another data point to the discussion, we've been using elasticsearch
as a data store in multiple customer implementations, and we were using its
predecessor Compass before that. ES performance is good and it scales well.
Like many others, I think ES is often sufficient as a data store by itself,
even if it does not have all the features provided by the likes of MongoDB,
CouchDB, etc.

We love the fact that Shay has focused on features that cannot be easily of
efficiently implemented outside ES. The rate of progress has been amazing
and stability of the solution is increasing with every release. As many
others, we also would like to be able to use and rely on ES as the master
data store. There are couple of challenges in this area:

  • API and the index format is not yet declared stable. To upgrade to a new
    version of the library (for new functionality or bug fixes) may require
    reindexing of all the data. When you deal with a lot of data, say 1TB+,
    reindexing is not a viable option. As a result, there is a concern that we
    may get stuck with an older/buggy version.
  • Only cases of data corruption we've seen happened due to resource
    problems. OOM errors, running out of disk space, etc. therefore monitoring
    of the environment is essential to prevent these type of problems. ES does
    provide a lot of performance and health metrics.
  • Reliable backups are not that easy. One way to alleviate the data
    corruption concerns is to backup the data, and make sure it can be restored
    somewhere else. When the worst case scenario is to restore the data from the
    last daily backup, you can sleep easier.

To be clear, these problems are not blockers for all projects we want to use
ES for. There are workarounds for most of them. But I think once these
challenges are addressed in ES, it would be a lot easier for people to use
it as their "single point of truth" as it was put. In short, from my
perspective a version 1.0 to signal a stable API and index structure, some
support and/or guidelines on how to backup and restore indices, and how to
monitor the available memory, disk, etc. is all that is left for ES to be
used as a database by itself by anyone.

My 1.23 cents ...

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Sat, Aug 13, 2011 at 10:54 PM, David Richardson <
david.richardson@enquora.com> wrote:

I don't have any experience with snapshotted filesystems, but I am
reading up on that now after you mentioned it.
I should have mentioned it, but I'm thinking primarily of ZFS. There's
btrfs too. You could use the word modern instead of snapshotted :wink:

So, this brings us to the dirty word in Lucene-based storage, re-indexing.
Whether or not you ever experience data corruption, you will most likely run
into a situation where you have to re-index all of your data. It may be
caused by a incompatible release, a change in your mapping files, or a data
corruption problem. When this happens, you will most often have to re-index
all of your data.
Understood. Yes, changing mapping files is an issue that I was wondering
about. One of the attractions of couch (relative to RDBMS) is the ability to
quickly restore (via filtered replication) a subset of the full data set in
order to restore partial functionality much quicker. Rebuild views on recent
and critical reference data first to keep users from reaching for pitchforks
and torches. Seems to me a careful sharding approach provides much the same
opportunity in ES wrt re-indexing and disaster recovery. And I presume that
the answer to erlang's hot code reloading is to take nodes
individually offline for upgrading or other maintenance, leaving the
cluster fully functional.

d.r.

Only cases of data corruption we've seen happened due to resource problems.
OOM errors, running out of disk space, etc

That this is at all possible is extremely worriesome, especially given the
most jvm deployments RAM appetite.

On Mon, Aug 15, 2011 at 10:09 PM, David Richardson <
david.richardson@enquora.com> wrote:

Only cases of data corruption we've seen happened due to resource problems.

OOM errors, running out of disk space, etc

That this is at all possible is extremely worriesome, especially given the
most jvm deployments RAM appetite.

We should be pass those now..., i.e. you won't loose data because of OOM
happening.

Right! To be clear, as I've mentioned, we've been using ES since the
beginning. we've not seen these recently.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Aug 15, 2011 at 4:52 PM, Shay Banon kimchy@gmail.com wrote:

On Mon, Aug 15, 2011 at 10:09 PM, David Richardson <
david.richardson@enquora.com> wrote:

Only cases of data corruption we've seen happened due to resource

problems. OOM errors, running out of disk space, etc

That this is at all possible is extremely worriesome, especially given the
most jvm deployments RAM appetite.

We should be pass those now..., i.e. you won't loose data because of OOM
happening.

At Mozilla, we have been building a general purpose API or glue layer to fill in these features that can easily exist outside ES. The project name is Bagheera. It can still potentially leave a big integration project with three or four "moving pieces": load balancer; middleware; primary datastore; secondary datastore. So far, this has been a good balance point for us. We have deployed bagheera I with ES in two modes, embedded (where the same jvm process runs both ES and Bag), and external (where Bag uses the Java client RPC to communicate with ES).

We have been using Bagheera as the place for our business logic hooks, caching, and operational management.

I don't think I am ready to be suggesting that the service is ready for a variety of people to deploy without being involved with development, but I am very happy with our results so far and I believe that the service will likely soon be generally useful for people to use in other projects.

If it sounds like something you would like to play with or contribute to, drop us a line.

Daniel Einspanjer
Metrics Engineering Manager
Mozilla Corporation

FWIW, we (tracermedia.com) have been fronting all of our requests to
ElasticSearch with Hazelcast. Hazelcast is as distributed as ES, and at its
most basic, it provides a set of distributed collections. Some of the cooler
features are distributed locks, semaphores, topics and queues. We are using
it as a high performance, distributed memcached layer. We also leverage it
for simple transaction support which ES lacks (this is not 2PC), and some
other uses I will mention.

Our applications are written using SSJS (not nodejs, but ringojs for the
full Java ecosystem support), and these applications talk only to Hazelcast
which is embedded alongside ES as part of our web application stack. We
wrote a trivial Hazelcast <-> ElasticSearch bridge which allows Hazelcast to
push and pull information from ES.

Hazelcast provides the hooks we need to support integration. We can have
components listening to Hazelcast queues for tasks that need to be executed
by only one node in a cluster, or listening to topics to be notified of a
system event. It isn't a full blown ESB, but it is very good when you need
something in between.

Heh.. I didn't mention it in my description, but Bagheera makes significant use of Hazelcast to support its glue like functionality.

Are there any annotation based ways to do persistence with ES in Java?

I've only seen a few dead projects