Question about ES w/ Couch(or any other db)


(Roger Studner) #1

I'm probably missing the obvious.. but what are the main/agreed upon
reasons for using something like CouchDB to actually store "the data"
and having the River stuff make the indexes in ES.

i.e. Why not use ES all by itself if in the end you are going to index
with it.. what is the great advantage to the hybrid solution (I
understand the inserts are slow to a lucene based anything.. but if you
have the River/synchronization from couch =-> ES.. isn't that still
taking up about the same time)

Thanks for any insight that can be provided

Best,
Roger


(Huy Phan) #2

Hi Roger,

Actually ES just "listens" to the changes in CouchDB and import the data
into itself. That means in the end ES still stores all the data and
doesn't make use of CouchDB as its actual backend storage like you said.
To test this, try to get ES index data from CouchDB, then shutdown
CouchDB and search in ES again, you can still see your documents under
_source fields.

But I still vote up for any idea that get ES stores only the indexes
data, and leave the original documents in CouchDB/MongoDB or any NoSQL.
The reason is to avoid data duplication: since your NoSQL was already
there in your system for many purposes and cannot be replaced, you may
want make use of it instead of copy its whole data to ES.

--huy

On 6/23/11 10:07 AM, Roger Studner wrote:

I'm probably missing the obvious.. but what are the main/agreed upon
reasons for using something like CouchDB to actually store "the data"
and having the River stuff make the indexes in ES.

i.e. Why not use ES all by itself if in the end you are going to
index with it.. what is the great advantage to the hybrid solution (I
understand the inserts are slow to a lucene based anything.. but if
you have the River/synchronization from couch =-> ES.. isn't that
still taking up about the same time)

Thanks for any insight that can be provided

Best,
Roger


(ppearcy) #3

We were a relatively early elasticsearch adopter and began an in depth
eval comparing it to solr about a year ago (which even back then it
one hands down) with a final production push around half a year ago.

This is no longer the case, but there were some issues in earlier
versions of ES that could lead to data loss. Both of the cases I know
of have been fixed, which were:

  • Network partitions leading to data loss - fixed in 0.16
  • Running out of disk space leading to corruption. Fixed a long time
    ago in lucene

I don't know of any current issues that could lead to data loss,
however, I still have a warm fuzzy feeling having a main data store
and syncing that out to Elasticsearch. Storage is cheap and losing
data can be expensive.

Huy makes a good point, as well. Our main data store has been around
for 10+ years and is deeply entrenched in our infrastructure and
Elasticsearch is tacked on for flexible and extremely fast searching.

I would say that if you are using ES as your main store, you're living
a little dangerously. Last I heard Shay did not recommend ES as the
primary data store, but that may have changed.

Best Regards,
Paul

On Jun 22, 9:04 pm, Huy Phan dac...@gmail.com wrote:

Hi Roger,

Actually ES just "listens" to the changes in CouchDB and import the data
into itself. That means in the end ES still stores all the data and
doesn't make use of CouchDB as its actual backend storage like you said.
To test this, try to get ES index data from CouchDB, then shutdown
CouchDB and search in ES again, you can still see your documents under
_source fields.

But I still vote up for any idea that get ES stores only the indexes
data, and leave the original documents in CouchDB/MongoDB or any NoSQL.
The reason is to avoid data duplication: since your NoSQL was already
there in your system for many purposes and cannot be replaced, you may
want make use of it instead of copy its whole data to ES.

--huy

On 6/23/11 10:07 AM, Roger Studner wrote:

I'm probably missing the obvious.. but what are the main/agreed upon
reasons for using something like CouchDB to actually store "the data"
and having the River stuff make the indexes in ES.

i.e. Why not use ES all by itself if in the end you are going to
index with it.. what is the great advantage to the hybrid solution (I
understand the inserts are slow to a lucene based anything.. but if
you have the River/synchronization from couch =-> ES.. isn't that
still taking up about the same time)

Thanks for any insight that can be provided

Best,
Roger


(Karel Minarik) #4

I would say that if you are using ES as your main store, you're living
a little dangerously. Last I heard Shay did not recommend ES as the
primary data store, but that may have changed.

Yes, precisely. We are keeping our data in CouchDB for backup/recovery
purposes. The application retrieves data from ES only.

Best,

Karel


(Clinton Gormley) #5

This is no longer the case, but there were some issues in earlier
versions of ES that could lead to data loss. Both of the cases I know
of have been fixed, which were:

  • Network partitions leading to data loss - fixed in 0.16

What fixes are you referring to in 0.16?

As far as I'm aware, if you have (eg) a 2 node cluster, and they stop
talking to each other, you end up with two masters. Inserting data into
either one of them will mean that that data isn't in the other node.

These nodes will not be able to join together without restarting one of
them, at which stage, any data that only that node has will be lost.

Or has the situation changed?

I don't know of any current issues that could lead to data loss,
however, I still have a warm fuzzy feeling having a main data store
and syncing that out to Elasticsearch. Storage is cheap and losing
data can be expensive.

Agreed.

The other point is that ES is near real time. You index a doc, and it
doesn't become visible for up to 1 second. While you can work around
this in your application, most DBs don't work this way, so it adds a
layer of complexity.

clint


(ppearcy) #6

Hey Clinton,
The issue I am referring that lead to data loss was discussed here:
http://elasticsearch-users.115913.n3.nabble.com/ES0-15-2-network-partition-can-lead-to-data-loss-td2663079.html

I'm not sure the exact update that fixed it, though, but it did go
into 0.16.0 and I spent a while verifying it. This was more of a self
destruct sequence where complete shards where just nuked.

Agreed, that a network partition can still lead to data loss for docs
indexed after the partition occurs. This is another great reason to
have a backing data store. We haven't hit this case, but are
provisioned for it, as we are able to replay transactions from the
primary data store into ES.

Best Regards,
Paul

On Jun 23, 2:52 am, Clinton Gormley clin...@iannounce.co.uk wrote:

This is no longer the case, but there were some issues in earlier
versions of ES that could lead to data loss. Both of the cases I know
of have been fixed, which were:

  • Network partitions leading to data loss - fixed in 0.16

What fixes are you referring to in 0.16?

As far as I'm aware, if you have (eg) a 2 node cluster, and they stop
talking to each other, you end up with two masters. Inserting data into
either one of them will mean that that data isn't in the other node.

These nodes will not be able to join together without restarting one of
them, at which stage, any data that only that node has will be lost.

Or has the situation changed?

I don't know of any current issues that could lead to data loss,
however, I still have a warm fuzzy feeling having a main data store
and syncing that out to Elasticsearch. Storage is cheap and losing
data can be expensive.

Agreed.

The other point is that ES is near real time. You index a doc, and it
doesn't become visible for up to 1 second. While you can work around
this in your application, most DBs don't work this way, so it adds a
layer of complexity.

clint


(Clinton Gormley) #7

Hi Paul

The issue I am referring that lead to data loss was discussed here:
http://elasticsearch-users.115913.n3.nabble.com/ES0-15-2-network-partition-can-lead-to-data-loss-td2663079.html

I'm not sure the exact update that fixed it, though, but it did go
into 0.16.0 and I spent a while verifying it. This was more of a self
destruct sequence where complete shards where just nuked.

Ah yes - I remember this.

Agreed, that a network partition can still lead to data loss for docs
indexed after the partition occurs. This is another great reason to
have a backing data store. We haven't hit this case, but are
provisioned for it, as we are able to replay transactions from the
primary data store into ES.

Yeah, me too. I must say that in recent releases I have found ES to be
very stable and reliable indeed. (at least when configured to avoid
using swap, and hosted on nodes with sufficient memory etc)

I've been using ES since version 0.4 and have a 'recovery' script which
compares the data I have in ES to the data in my DB and fixes anything
that is missing. I used to have to use it frequently, and it saved my
bacon many times.

These days I run it, but it never needs to repair anything - all the
data is there. Even when we had OOM problems caused by 0.16.0, ES didn't
lose any of my data.

Awesome job!

clint


(system) #8