May I use ES as DB to replace MongoDB?


(谢乐冰) #1

Not a joke.

We have events log (userid, timestamp, action, entity ....) which records
players' essential activities and is used for customer service. The volume
is around 10-15 million rows a day and held for 3 months. The search
condition could be complicated, such like userid + time range + activities;
timerange + activities so on.

Currently 3 solutions are considered:

  1. Use MongoDB cluster to hold the data.
  2. Use ES to index the log and for searching. Easy to setup and maintain.
  3. Use HBASE, but have to create multiple "indexes"

any idea about that? Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGJwNY4JYTXUhdBJ_d77iOHSf%2Bo%3Djs3p0h1%2Bx7URpqKBAe5e6w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Eugene Strokin) #2

It seems like you don't really need a search, but just filtering, so, you'd
use a subset of features from ElasticSearch. But why would you think you
cannot use ES as DB? What would be your concern?
Just, so you know, I use ES as the only storage for one of my project for
second year already, for Big Data/BigTraffic application. And if you do
things right, you should be allright as well.

Eugene

On Monday, January 13, 2014 5:31:24 AM UTC-5, Xie Lebing wrote:

Not a joke.

We have events log (userid, timestamp, action, entity ....) which records
players' essential activities and is used for customer service. The volume
is around 10-15 million rows a day and held for 3 months. The search
condition could be complicated, such like userid + time range + activities;
timerange + activities so on.

Currently 3 solutions are considered:

  1. Use MongoDB cluster to hold the data.
  2. Use ES to index the log and for searching. Easy to setup and maintain.
  3. Use HBASE, but have to create multiple "indexes"

any idea about that? Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/98c63853-9faf-4d0c-912a-1698fbf91399%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(amos.wood) #3

For one our projects, we also use Elasticsearch as the sole database. The
only consideration to make is that while gets by id are real-time, all
other searches are subject to the "refresh interval" setting of a
particular index/table. We overcame this problem by:

  1. Set the refresh_interval at 25ms
  2. After a write to our service, we paused for 25ms before returning a
    successful write to the client.
  3. Put an automatic retry mechanism on particular calls. This helped when
    the index servers had heavy traffic and the "refresh interval" actually
    took more than 25ms. This scenario happened when a client wrote a record
    and immediately wanted to get it by a field other than its id.

On Monday, January 13, 2014 11:24:00 AM UTC-6, Eugene Strokin wrote:

It seems like you don't really need a search, but just filtering, so,
you'd use a subset of features from ElasticSearch. But why would you think
you cannot use ES as DB? What would be your concern?
Just, so you know, I use ES as the only storage for one of my project for
second year already, for Big Data/BigTraffic application. And if you do
things right, you should be allright as well.

Eugene

On Monday, January 13, 2014 5:31:24 AM UTC-5, Xie Lebing wrote:

Not a joke.

We have events log (userid, timestamp, action, entity ....) which records
players' essential activities and is used for customer service. The volume
is around 10-15 million rows a day and held for 3 months. The search
condition could be complicated, such like userid + time range + activities;
timerange + activities so on.

Currently 3 solutions are considered:

  1. Use MongoDB cluster to hold the data.

  2. Use ES to index the log and for searching. Easy to setup and maintain.

  3. Use HBASE, but have to create multiple "indexes"

any idea about that? Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/18f3da3f-a722-4c67-8fdf-4e8c5c638a30%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(davrob) #4

From my understanding, which admittedly is limited, there is still
potential to lose data with ElasticSearch.

Even with the new Snapshot API running regularly, if all indexes get
corrupted, there is no guarantee of 100% data backup and restore, because
you would lose the data which was added between you last snapshot and any
subsequent updates to the index.

On Tuesday, 14 January 2014 13:23:55 UTC, amos.wood wrote:

For one our projects, we also use Elasticsearch as the sole database. The
only consideration to make is that while gets by id are real-time, all
other searches are subject to the "refresh interval" setting of a
particular index/table. We overcame this problem by:

  1. Set the refresh_interval at 25ms
  2. After a write to our service, we paused for 25ms before returning a
    successful write to the client.
  3. Put an automatic retry mechanism on particular calls. This helped when
    the index servers had heavy traffic and the "refresh interval" actually
    took more than 25ms. This scenario happened when a client wrote a record
    and immediately wanted to get it by a field other than its id.

On Monday, January 13, 2014 11:24:00 AM UTC-6, Eugene Strokin wrote:

It seems like you don't really need a search, but just filtering, so,
you'd use a subset of features from ElasticSearch. But why would you think
you cannot use ES as DB? What would be your concern?
Just, so you know, I use ES as the only storage for one of my project for
second year already, for Big Data/BigTraffic application. And if you do
things right, you should be allright as well.

Eugene

On Monday, January 13, 2014 5:31:24 AM UTC-5, Xie Lebing wrote:

Not a joke.

We have events log (userid, timestamp, action, entity ....) which
records players' essential activities and is used for customer service. The
volume is around 10-15 million rows a day and held for 3 months. The search
condition could be complicated, such like userid + time range + activities;
timerange + activities so on.

Currently 3 solutions are considered:

  1. Use MongoDB cluster to hold the data.
  2. Use ES to index the log and for searching. Easy to setup and
    maintain.
  3. Use HBASE, but have to create multiple "indexes"

any idea about that? Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cc84bbab-252c-4ecf-ab17-3ef6cb10a621%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Eugene Strokin) #5

You are correct. But how this is different from any other DB?
I guess the question is more like: if I'm running ES under normal conditions, could index get corrupted?
If this is hardware issue, and you have replication switched on, then you wouldn't get affected much. Your system will continue functioning but state would become yellow. You'd need to replase the node and this is it.
Some people claimed, that they expirienced sudden index corruption with data loss. I myself nether saw anything like this. Even though I had done few times stupid things, and had near hart stroke feelings but data wasn't lost at the end, and again I have nothing to blame but myself.

Regarding stability I could say that ES has not gave us any problems. I was performing such things with success on production envirement with zero downtime:

  • adding nodes and replication
  • transitioning data to another data center
  • adding more clients
    Etc...

I'd really like to hear from people who expirienced data loss. If someone would provide details this would help us to understand that was wrong and what we should avoid doing.
But becides claims that there are such cases, I didn't hear anything else.

Eugene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/94e2e909-9b71-43a4-990e-964d528f2dd9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(lebowitz) #6

I was talking about using ES as a system of record with my friendly IT director today. We were brainstorming about how 'backup" would work.

The lucene index is immutable, so we can think about ES data as a transaction log. We can recreate from _source at a given time with a scan/scroll archive of docs at an interval, e.g 1h. This is exactly the same as backing up db transaction logs.


(davrob) #7

Hi Eugene,

Thanks for your comments - I'll do my best to explain where I am coming
from, and to address some of the issues you have raised.

Firstly, where I'm coming from: the data I'm holding and searching against
needs to be 100% backed up because it needs to be audited in the future.
For that reason the data is held on an old fashioned multi-master
replicated relational DB.

In terms of the issues you raised:

  1. But how this is different from any other DB?

i) With relational DBs it is part of the strategy to replay the
transaction logs to make up for any data that hasn't been backed up. I've
heard of people doing this with ES, but it is not documented well anywhere,
additionally the transaction logs, to my limited understanding, are kept in
the same area as the index files and can suffer corruption. I think there
may be some monitoring in version 1.0 to stop ES writing to disk before the
files become corrupted, which would help. But the first point, that there
is no clear transaction log replay strategy outlined for elasticsearch.

ii) Multi-master replication - no doubt its possible to arrange JMS queues
or hazelcast/coherence grids to do this - but a build in solution would be
useful.

  1. Examples of data loss - upgrading elasticsearch versions, I've ended up
    losing all data, no doubt through my own fault, and maybe I'd have been
    more careful, and read upgrade instructions more carefully if I'd have know
    that my data was not backed up in the relational database, but it is
    definitely something that plays on my mind: "If I screw up this upgrade
    process, or misunderstand the upgrade process then that's it my data is
    gone"

So, I would probably add the following, although I could be wrong, because
I have not read every blog relating to ES upgrades:

  1. But how this is different from any other DB?

iii) There is no clear, consistent, well documented process of upgrading
elasticsearch versions, particularly when the underlying Lucene version
changes.

David.

On Tuesday, 14 January 2014 20:13:22 UTC, Eugene Strokin wrote:

You are correct. But how this is different from any other DB?
I guess the question is more like: if I'm running ES under normal
conditions, could index get corrupted?
If this is hardware issue, and you have replication switched on, then you
wouldn't get affected much. Your system will continue functioning but state
would become yellow. You'd need to replase the node and this is it.
Some people claimed, that they expirienced sudden index corruption with
data loss. I myself nether saw anything like this. Even though I had done
few times stupid things, and had near hart stroke feelings but data wasn't
lost at the end, and again I have nothing to blame but myself.

Regarding stability I could say that ES has not gave us any problems. I
was performing such things with success on production envirement with zero
downtime:

  • adding nodes and replication
  • transitioning data to another data center
  • adding more clients
    Etc...

I'd really like to hear from people who expirienced data loss. If someone
would provide details this would help us to understand that was wrong and
what we should avoid doing.
But becides claims that there are such cases, I didn't hear anything else.

Eugene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5cff97f3-9541-4cba-a3c2-be0d8ad4440d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(lebowitz) #8

I was talking about using ES as a system of record with my friendly IT director today. We were brainstorming about how 'backup" would work.

The lucene index is immutable, so we can think about ES data as a transaction log. We can recreate from _source at a given time with a scan/scroll archive of docs at an interval, e.g 1h. This is exactly the same as backing up db transaction logs.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/efa02928-b2f6-4c0e-a5ce-faf212c3e638%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #9

Note, there is a valuable snapshot/restore facility coming in ES 1.0.0,
with incremental snapshots.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/modules-snapshots.html

Jörg

On Thu, Jan 16, 2014 at 3:37 PM, Craig Lebowitz craiglebowitz@gmail.comwrote:

I was talking about using ES as a system of record with my friendly IT
director today. We were brainstorming about how 'backup" would work.

The lucene index is immutable, so we can think about ES data as a
transaction log. We can recreate from _source at a given time with a
scan/scroll archive of docs at an interval, e.g 1h. This is exactly the
same as backing up db transaction logs.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/efa02928-b2f6-4c0e-a5ce-faf212c3e638%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFO8GnHowQ5-p-yHCY8ytTkYXitXOgq%3DKu6gOohVcFGgg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #10