ElasticSearch vs NoSQL


(Gísli Kristjánsson) #1

I must begin by praising the effort. I have been in a nosql-research-
mode for the last few days and I'm still discovering new cool
projects. From what I've seen from the official website (Screencast,
Docs, and forum) ElasticSearch is definitely one of the more
impressing projects sofar (and still I just discovered it a couple of
hours ago).

As I'm now comfortable with the CAP Theorem I think I'm getting ready
to pick my cherry from the myriad of NoSQL options. The top contenders
for me at the moment are MongoDB and Riak. I lean towards Riak (from a
design and implementation perspective) but MongoDB's query language
seems very powerful.

After reading the NoSQL, Yes Search (http://www.elasticsearch.com/blog/
2010/02/25/nosql_yessearch.html) I concluded that a mix of Riak with
search supported with ElasticServer might be the perfect combination
(as described in the blog entry).

After I while I started asking myself; why do I need to use
ElasticSearch as a supplement to another NoSQL implementation since
the whole object seems to be stored within Elastic Search. This seems
to be further supported by the introduction of binary attachments
(http://groups.google.com/a/elasticsearch.com/group/users/
browse_thread/thread/f0a26efd88365bad#). Am I missing something here
or is the ElasticSerach only to be used in conjunction with another
datastore}


(Sergio Bossa) #2

2010/4/5 Gísli Kristjánsson gislik@hamstur.is:

After reading the NoSQL, Yes Search (http://www.elasticsearch.com/blog/
2010/02/25/nosql_yessearch.html) I concluded that a mix of Riak with
search supported with ElasticServer might be the perfect combination
(as described in the blog entry).

Hi Gisli,

the "NoSQL, Yes Search" blog post mentions that ElasticSearch has
already been integrated with the Terrastore NoSQL store, so I'd like
to know why you think Riak would be a better fit/choice: your feedback
will help us improve the integration and understand what's wrong.

Thanks for sharing your thoughts,
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob


(timrobertson100) #3

I am also interested in the response to the original question.
With ES storing the JSON document, it seems a kludgy integration since
the Doc is stored twice over (I'm interested in HBase because of
MapReduce support for other needs). Would a better integration be to
allow ES handle all indexing, only store the DocID in the index, and
hook up the datastore so that ES delegates all GetByKey to the
underlying storage system?

On Mon, Apr 5, 2010 at 9:36 AM, Sergio Bossa sergio.bossa@gmail.com wrote:

2010/4/5 Gísli Kristjánsson gislik@hamstur.is:

After reading the NoSQL, Yes Search (http://www.elasticsearch.com/blog/
2010/02/25/nosql_yessearch.html) I concluded that a mix of Riak with
search supported with ElasticServer might be the perfect combination
(as described in the blog entry).

Hi Gisli,

the "NoSQL, Yes Search" blog post mentions that ElasticSearch has
already been integrated with the Terrastore NoSQL store, so I'd like
to know why you think Riak would be a better fit/choice: your feedback
will help us improve the integration and understand what's wrong.

Thanks for sharing your thoughts,
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob


(Gísli Kristjánsson) #4

Hi Sergio (and Tim),

Terrastore is a very promising alternative but the reasons I have for
prefering Riak are:

  • Riak is more mature
  • Documents
  • Community
  • (Enterprise) Support
  • I like the consept of no master setup in Riak
  • I program in Erlang and a storage system on the same platform is
    good feelingTM
  • MapReduce is a powerful way to transform/query data

I'll be keeping an eye on Terrastore though as it improves.

Back to my original question (as Tim and I seem eagerly interested).
If the ES is not intended to be the storage system a solution like Tim
suggested is very interesting. And, can someone explain the difference
to me between an index system (such as ES) that stores the data
(including binaries via attachment plugin) and a storage system (like
Terrastore)?

Thanks,
Gísli

On Apr 5, 11:33 am, Tim Robertson timrobertson...@gmail.com wrote:

I am also interested in the response to the original question.
With ES storing the JSON document, it seems a kludgy integration since
the Doc is stored twice over (I'm interested in HBase because of
MapReduce support for other needs). Would a better integration be to
allow ES handle all indexing, only store the DocID in the index, and
hook up the datastore so that ES delegates all GetByKey to the
underlying storage system?

On Mon, Apr 5, 2010 at 9:36 AM, Sergio Bossa sergio.bo...@gmail.com wrote:

2010/4/5 Gísli Kristjánsson gis...@hamstur.is:

After reading the NoSQL, Yes Search (http://www.elasticsearch.com/blog/
2010/02/25/nosql_yessearch.html) I concluded that a mix of Riak with
search supported with ElasticServer might be the perfect combination
(as described in the blog entry).

Hi Gisli,

the "NoSQL, Yes Search" blog post mentions that ElasticSearch has
already been integrated with the Terrastore NoSQL store, so I'd like
to know why you think Riak would be a better fit/choice: your feedback
will help us improve the integration and understand what's wrong.

Thanks for sharing your thoughts,
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob


(Gísli Kristjánsson) #5

Also as I see you're the author of Terrastore I got the following
error when trying to start the server (after a successful master
startup) on my MacBook Pro:

MacBook-Pro:bin gislik$ sh start.sh --master localhost:9510
Starting Terrastore Server ...
Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad
version number in .class file
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:676)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:
124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:
260)
at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at sun.misc.Launcher$AppClassLoader.findClass(Launcher.java)
at java.lang.ClassLoader.loadClass(ClassLoader.java:317)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:
280)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:
375)

On Apr 5, 12:12 pm, Gísli Kristjánsson gis...@hamstur.is wrote:

Hi Sergio (and Tim),

Terrastore is a very promising alternative but the reasons I have for
prefering Riak are:

  • Riak is more mature
  • Documents
  • Community
  • (Enterprise) Support
  • I like the consept of no master setup in Riak
  • I program in Erlang and a storage system on the same platform is
    good feelingTM
  • MapReduce is a powerful way to transform/query data

I'll be keeping an eye on Terrastore though as it improves.

Back to my original question (as Tim and I seem eagerly interested).
If the ES is not intended to be the storage system a solution like Tim
suggested is very interesting. And, can someone explain the difference
to me between an index system (such as ES) that stores the data
(including binaries via attachment plugin) and a storage system (like
Terrastore)?

Thanks,
Gísli

On Apr 5, 11:33 am, Tim Robertson timrobertson...@gmail.com wrote:

I am also interested in the response to the original question.
With ES storing the JSON document, it seems a kludgy integration since
the Doc is stored twice over (I'm interested in HBase because of
MapReduce support for other needs). Would a better integration be to
allow ES handle all indexing, only store the DocID in the index, and
hook up the datastore so that ES delegates all GetByKey to the
underlying storage system?

On Mon, Apr 5, 2010 at 9:36 AM, Sergio Bossa sergio.bo...@gmail.com wrote:

2010/4/5 Gísli Kristjánsson gis...@hamstur.is:

After reading the NoSQL, Yes Search (http://www.elasticsearch.com/blog/
2010/02/25/nosql_yessearch.html) I concluded that a mix of Riak with
search supported with ElasticServer might be the perfect combination
(as described in the blog entry).

Hi Gisli,

the "NoSQL, Yes Search" blog post mentions that ElasticSearch has
already been integrated with the Terrastore NoSQL store, so I'd like
to know why you think Riak would be a better fit/choice: your feedback
will help us improve the integration and understand what's wrong.

Thanks for sharing your thoughts,
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob


(Shay Banon) #6

Hi all,

Its a very interesting question, how to use elasticsearch within your
architecture. Let me first explain why elasticsearch stores (by default, it
can be disabled in the next version) the json source. The idea is that when
you search, the search request is already executing right where the data is.
For this reason, if you are already local to the data, it makes a lot of
sense to also fetch what needs to be displayed in the search results as
well. If the _source field is disabled, then, for the N number of hits you
get back, you need to execute N fetch requests (or a single batch, if multi
keys fetch is supported) to your data storage to fetch it. If its ok in
terms of latency and overhead on the system, then its acceptable, assuming
that storing the actual source json is a big overhead within the index
storage.

Also note that elasticsearch is a near real time search (though I hope to
get it to be real time some day). This means that if you index a document,
your search/get requests will see it after a certain interval (can be
configured).

As to the question if elasticsearch can basically act as the single nosql
solution of choice, namely the main storage of your data, it depends. First
note, that elasticsearch is not a 1.0 version (I consider it a strong beta,
some sites are about to go live with it any day now), so, I would consider
not using it as the main data storage currently. This is for the simple
reason that if something goes really bad, you can always reindex the data.

I have worked and been involved with several projects that actually used
Lucene as the main storage system of applications, and they were happy with
it. Will elasticsearch become a possible main data storage? Depends. If what
it provides fits the bill, and it goes GA, then go with it. If not (you need
versioning, transactionality), then it can certainly be a complimentary
solution to your nosql of choice.

If you do decide to go with Riak, then elasticsearch is certainly a good
choice here, as it gives you the ability to have a very rich query model and
search on top of your data. As a side note, I know Riak are working on a
search engine. Not sure when it is going to come out. But, as skilled as the
people on riak land are (and they really are), I doubt that they can easily
build something that can compare to the richness of elasticsearch (and
Lucene under it).

No matter which solution you choose to go with, I would love to cooperate
on getting some sort of a plugin built into elasticsearch to automatically
index the nosql you work with. Unless, of course, you go with terrastore,
which has it built in :).

cheers,
shay.banon

2010/4/5 Gísli Kristjánsson gislik@hamstur.is

Hi Sergio (and Tim),

Terrastore is a very promising alternative but the reasons I have for
prefering Riak are:

  • Riak is more mature
  • Documents
  • Community
  • (Enterprise) Support
  • I like the consept of no master setup in Riak
  • I program in Erlang and a storage system on the same platform is
    good feelingTM
  • MapReduce is a powerful way to transform/query data

I'll be keeping an eye on Terrastore though as it improves.

Back to my original question (as Tim and I seem eagerly interested).
If the ES is not intended to be the storage system a solution like Tim
suggested is very interesting. And, can someone explain the difference
to me between an index system (such as ES) that stores the data
(including binaries via attachment plugin) and a storage system (like
Terrastore)?

Thanks,
Gísli

On Apr 5, 11:33 am, Tim Robertson timrobertson...@gmail.com wrote:

I am also interested in the response to the original question.
With ES storing the JSON document, it seems a kludgy integration since
the Doc is stored twice over (I'm interested in HBase because of
MapReduce support for other needs). Would a better integration be to
allow ES handle all indexing, only store the DocID in the index, and
hook up the datastore so that ES delegates all GetByKey to the
underlying storage system?

On Mon, Apr 5, 2010 at 9:36 AM, Sergio Bossa sergio.bo...@gmail.com
wrote:

2010/4/5 Gísli Kristjánsson gis...@hamstur.is:

After reading the NoSQL, Yes Search (
http://www.elasticsearch.com/blog/

2010/02/25/nosql_yessearch.html) I concluded that a mix of Riak with
search supported with ElasticServer might be the perfect combination
(as described in the blog entry).

Hi Gisli,

the "NoSQL, Yes Search" blog post mentions that ElasticSearch has
already been integrated with the Terrastore NoSQL store, so I'd like
to know why you think Riak would be a better fit/choice: your feedback
will help us improve the integration and understand what's wrong.

Thanks for sharing your thoughts,
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob


(Sergio Bossa) #7

2010/4/5 Gísli Kristjánsson gislik@hamstur.is:

Terrastore is a very promising alternative but the reasons I have for
prefering Riak are:

  • Riak is more mature
  • Documents
  • Community
  • (Enterprise) Support
  • I like the consept of no master setup in Riak
  • I program in Erlang and a storage system on the same platform is
    good feelingTM
  • MapReduce is a powerful way to transform/query data

I'll be keeping an eye on Terrastore though as it improves.

Got it, you have absolutely valid reasons.
Just to be clear, I didn't want to endorse Terrastore, only know the
reason of your choice: Riak is great, and if it fits your need better
than others, just go with it :wink:

--
Sergio Bossa
http://www.linkedin.com/in/sergiob


(Sergio Bossa) #8

2010/4/5 Gísli Kristjánsson gislik@hamstur.is:

Also as I see you're the author of Terrastore I got the following
error when trying to start the server (after a successful master
startup) on my MacBook Pro:

It seems a problem with your JDK version: do you mind moving your
question to the Terrastore mailing list, it's off-topic here and I
don't want to annoy ElasticSearch users :wink:

Thanks!

--
Sergio Bossa
http://www.linkedin.com/in/sergiob


(Sergio Bossa) #9

On Mon, Apr 5, 2010 at 1:33 PM, Tim Robertson timrobertson100@gmail.com wrote:

Would a better integration be to
allow ES handle all indexing, only store the DocID in the index, and
hook up the datastore so that ES delegates all GetByKey to the
underlying storage system?

There's an issue about that, feel free to comment on:


I agree it would be great, maybe I'll find some time to contribute
some code to the already amazing work made by Shay :wink:

--
Sergio Bossa
http://www.linkedin.com/in/sergiob


(Berkay Mollamustafaoglu-2) #10

"Can someone explain the difference to me between an index system (such as
ES) that stores the data (including binaries via attachment plugin) and a
storage system (like Terrastore)? "

This is a question that's been in my mind for some time as well. (Shay has
provided his take on it as I was writing this). It is clear what ES brings
to the table when used along side with a nosql solution. It is harder to pin
down what document stores provide that ES does not. I suspect the answer is
different for different nosql solutions. I'd be great to hear from users of
the various nosql solutions as they get familiar with ES. Shay already
pointed out couple of areas where ES may not suitable: Transactionality and
Near Real-time (as opposed to real time). However, most nosql solutions
don't have transaction support either. I'm looking forward to get educated
on what else document stores bring to the table :slight_smile:

Also, as Shay warns, ES is new and not GA, but it leverages mature libraries
which is helpful. The fact that it uses Lucene (a mature library) as the
data store rather is a great comfort and may be preferred to relatively
untested nature of nosql stores.

Regards,
Berkay Mollamustafaoglu
http://www.ifountain.com
Ph: +1 (571) 766-6292
mberkay on yahoo, google and skype

On Mon, Apr 5, 2010 at 9:57 AM, Sergio Bossa sergio.bossa@gmail.com wrote:

2010/4/5 Gísli Kristjánsson gislik@hamstur.is:

Terrastore is a very promising alternative but the reasons I have for
prefering Riak are:

  • Riak is more mature
  • Documents
  • Community
  • (Enterprise) Support
  • I like the consept of no master setup in Riak
  • I program in Erlang and a storage system on the same platform is
    good feelingTM
  • MapReduce is a powerful way to transform/query data

I'll be keeping an eye on Terrastore though as it improves.

Got it, you have absolutely valid reasons.
Just to be clear, I didn't want to endorse Terrastore, only know the
reason of your choice: Riak is great, and if it fits your need better
than others, just go with it :wink:

--
Sergio Bossa
http://www.linkedin.com/in/sergiob


(timrobertson100) #11

It is harder to pin down what document stores provide that ES does not

With HBase, a huge advantage is the other family of Hadoop products.

  • Hive (a SQL "engine" from Facebook) gives
    • the ability to run "reports" such as counts with group by's on huge data
    • ability to do huge joins easily (I did 200million to 200 million
      producing >1Billion in under 10 mins)
    • can run on delimited files (e.g. CSVs)
    • Hive has an HBase input format
  • MapReduce from Hadoop

there are a few indexing options popping up on HBase, which led me to
search and land on this mailing list. Some investigation shows ES is
a nice candidate to offer the search capabilities missing natively on
HBase so I am pondering some integration.

On Mon, Apr 5, 2010 at 4:16 PM, Berkay Mollamustafaoglu
mberkay@gmail.com wrote:

"Can someone explain the difference to me between an index system (such as
ES) that stores the data (including binaries via attachment plugin) and a
storage system (like Terrastore)? "
This is a question that's been in my mind for some time as well. (Shay has
provided his take on it as I was writing this). It is clear what ES brings
to the table when used along side with a nosql solution. It is harder to pin
down what document stores provide that ES does not. I suspect the answer is
different for different nosql solutions. I'd be great to hear from users of
the various nosql solutions as they get familiar with ES. Shay already
pointed out couple of areas where ES may not suitable: Transactionality and
Near Real-time (as opposed to real time). However, most nosql solutions
don't have transaction support either. I'm looking forward to get educated
on what else document stores bring to the table :slight_smile:
Also, as Shay warns, ES is new and not GA, but it leverages mature libraries
which is helpful. The fact that it uses Lucene (a mature library) as the
data store rather is a great comfort and may be preferred to relatively
untested nature of nosql stores.

Regards,
Berkay Mollamustafaoglu
http://www.ifountain.com
Ph: +1 (571) 766-6292
mberkay on yahoo, google and skype

On Mon, Apr 5, 2010 at 9:57 AM, Sergio Bossa sergio.bossa@gmail.com wrote:

2010/4/5 Gísli Kristjánsson gislik@hamstur.is:

Terrastore is a very promising alternative but the reasons I have for
prefering Riak are:

  • Riak is more mature
  • Documents
  • Community
  • (Enterprise) Support
  • I like the consept of no master setup in Riak
  • I program in Erlang and a storage system on the same platform is
    good feelingTM
  • MapReduce is a powerful way to transform/query data

I'll be keeping an eye on Terrastore though as it improves.

Got it, you have absolutely valid reasons.
Just to be clear, I didn't want to endorse Terrastore, only know the
reason of your choice: Riak is great, and if it fits your need better
than others, just go with it :wink:

--
Sergio Bossa
http://www.linkedin.com/in/sergiob


(Gísli Kristjánsson) #12

This is now an open issue on the Terrastore's Google Code :slight_smile:

On Apr 5, 1:59 pm, Sergio Bossa sergio.bo...@gmail.com wrote:

2010/4/5 Gísli Kristjánsson gis...@hamstur.is:

Also as I see you're the author of Terrastore I got the following
error when trying to start the server (after a successful master
startup) on my MacBook Pro:

It seems a problem with your JDK version: do you mind moving your
question to the Terrastore mailing list, it's off-topic here and I
don't want to annoy ElasticSearch users :wink:

Thanks!

--
Sergio Bossahttp://www.linkedin.com/in/sergiob


(Sergio Bossa) #13

On Mon, Apr 5, 2010 at 4:16 PM, Berkay Mollamustafaoglu
mberkay@gmail.com wrote:

It is clear what ES brings
to the table when used along side with a nosql solution. It is harder to pin
down what document stores provide that ES does not.

There are certainly a few things that ElasticSearch doesn't
(currently) offer as a storage solution, more specifically:

  1. Real-time storage: storing and getting back data depends on near
    real time Lucene capabilities.
  2. Durability: indexes aren't durable across node restarts, unless you
    configure a gateway whose persistence is, however, snapshot based, so
    you may lose the latest data (AFAIU, please correct me if wrong).
  3. Performance: Lucene isn't intended as a storage solution; it may or
    may not work for your needs, but again, that's not the intended use
    (and in my own experience, it doesn't work).

In other words, in order to be a complete storage and indexing
solution by its own, ElasticSearch should IMHO offer separated storage
for its documents, maybe something like an embedded java berkeley db,
but it's not that easy and I don't know if it makes sense to provide
from scratch what other (SQL/NoSQL) solutions already do ... but Shay
has absolutely the last word on that :wink:

Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob


(Shay Banon) #14

Hi,

Let me first address point 2, as its core elasticsearch. ElasticSearch
does provides durability. Snapshots pointed at gateway are interval based,
but everything (indices and a transaction log) are maintained across shard
replicas. This means that if a node fails, then the replicas will make sure
everything is snapshotted to the gateway properly. This is how write behind
works in most data grid vendors (coherence, gigaspaces).

Performance wise, well, it all depends on in memory caching. As anybody
who used berkleydb when not all its btree nodes manage to fit in memory
knows :). Currently, Lucene should be as fast as berkely assuming data
resides on disk (and faster on SSDs, and yet faster with in memory storage,
thanks to how it works). One thing that I plan to add is caching on other
levels than just query filters cache, but to be honest, most times, its not
really needed... .

ElasticSearch by no means aims to replace other nosql solutions. Its going
to evolve and provides its own features. If they fit the bill, great. If
not, elasticsearch is going to integrate well with most nosql solutions out
there. How well? I really hope that by 1.0, elasticsearch will be able to
automatically index data in most common nosql solutions (with the help of
the community, I will write the first one :wink: ).

cheers,
shay.banon

On Mon, Apr 5, 2010 at 8:31 PM, Sergio Bossa sergio.bossa@gmail.com wrote:

On Mon, Apr 5, 2010 at 4:16 PM, Berkay Mollamustafaoglu
mberkay@gmail.com wrote:

It is clear what ES brings
to the table when used along side with a nosql solution. It is harder to
pin
down what document stores provide that ES does not.

There are certainly a few things that ElasticSearch doesn't
(currently) offer as a storage solution, more specifically:

  1. Real-time storage: storing and getting back data depends on near
    real time Lucene capabilities.
  2. Durability: indexes aren't durable across node restarts, unless you
    configure a gateway whose persistence is, however, snapshot based, so
    you may lose the latest data (AFAIU, please correct me if wrong).
  3. Performance: Lucene isn't intended as a storage solution; it may or
    may not work for your needs, but again, that's not the intended use
    (and in my own experience, it doesn't work).

In other words, in order to be a complete storage and indexing
solution by its own, ElasticSearch should IMHO offer separated storage
for its documents, maybe something like an embedded java berkeley db,
but it's not that easy and I don't know if it makes sense to provide
from scratch what other (SQL/NoSQL) solutions already do ... but Shay
has absolutely the last word on that :wink:

Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob


(alexandre gerlic) #15

2010/4/6 Shay Banon shay.banon@elasticsearch.com:

Hi,
Let me first address point 2, as its core elasticsearch. ElasticSearch
does provides durability. Snapshots pointed at gateway are interval based,
but everything (indices and a transaction log) are maintained across shard
replicas. This means that if a node fails, then the replicas will make sure
everything is snapshotted to the gateway properly. This is how write behind
works in most data grid vendors (coherence, gigaspaces).

Hi,

to interact between cassandra and ES, I am currently working on this way :

  • put/remove on Cassandra will call ES via Java API (same behavior as
    blog post "NoSQL, Yes Search"
  • create CassandraGateway and CassandraIndexGateway
  • gateway_index_snapshot disabled
  • gateway_index_recover created from Cassandra : create Translog (only
    CREATE instructions) from Cassandra

Except _source disabled issue, the fact is to avoid to double data
between nosql solution and ES.
If ES cluster crash, I hope this solution will help me to recreate ES
cluster directly from database
instead of file system.

--
Alexandre Gerlic


(Shay Banon) #16

On Tue, Apr 6, 2010 at 2:47 AM, alexandre gerlic <alexandre.gerlic@gmail.com

wrote:

2010/4/6 Shay Banon shay.banon@elasticsearch.com:

Hi,
Let me first address point 2, as its core elasticsearch. ElasticSearch
does provides durability. Snapshots pointed at gateway are interval
based,
but everything (indices and a transaction log) are maintained across
shard
replicas. This means that if a node fails, then the replicas will make
sure
everything is snapshotted to the gateway properly. This is how write
behind
works in most data grid vendors (coherence, gigaspaces).

Hi,

to interact between cassandra and ES, I am currently working on this way :

  • put/remove on Cassandra will call ES via Java API (same behavior as
    blog post "NoSQL, Yes Search"

Nice!. Wondering here about edge cases with how cassandra work (know it in
theory and partly by code). Would love to see some code if you have it.

  • create CassandraGateway and CassandraIndexGateway
  • gateway_index_snapshot disabled
  • gateway_index_recover created from Cassandra : create Translog (only
    CREATE instructions) from Cassandra

I think that it would make sense to store the full index and the transaction
log on cassandra. Rebuilding the index is not something that you would want
to do. Storing the index itself is a simple manner of simulating a file
system on top of cassandra API.

Except _source disabled issue, the fact is to avoid to double data
between nosql solution and ES.

I have explained why I think storing the _source in elasticsearch still make
sense. But of course, the option is there to disable it.

If ES cluster crash, I hope this solution will help me to recreate ES
cluster directly from database
instead of file system.

I think that if you store the index itself on cassandara as well, even if
the whole elasticsearch cluster crashes, you won't have to reindex the data.
Thats the general idea.

--
Alexandre Gerlic


(Eks Dev) #17

I just started playing with ES and had to comment this subject.

imo, this question in subject (discussion is great!) is plain wrong. What I
would like to see somewhere is rather search and "nosql db". Keeping these
two topics apart is like saying, OK let us separate DBMS from indexing and
SQL. Search is great, nosqldb-s are great, but not enough.

"traditional search" is just one application, useful, but just one
application. More traditional, and much more general computation model is to
have some way to locate data (old way "SQL", new way "search"), retrieve
data (old way "SQL", new way nosql KV stores), do something with data (SQL
vs map-reduce today on mega-data) and put it back to storage/deliver
outside.

What I am trying to say, the "new way" has one missing link, keeps data in
two completely separate worlds, technologically and logically apart (think
e.g. hbase and ES or cassandra and solr). This is expensive, hard to setup,
hard to keep in sync, duplicates demand on resources ...

In ideal world, imagine hbase where each node keeps embedded lucene to
expose search part with all this magic Shay is doing with ES. This would
become one infrastructure to keep all players in sync , one set of APIs to
talk to clients... It Seams riak goes this way.

I think I see this way of thinking behind ES, so imagine ES doing
map-reduce, keeping your data safe like hbase... :slight_smile:

Dreaming in public is, I guess, OK

Cheers,
Eks

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/ElasticSearch-vs-NoSQL-tp696971p2694954.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(Eks Dev) #18

I just started playing with ES and had to comment this subject.

imo, this question in subject (discussion is great!) is plain wrong. What I would like to see somewhere is rather search and "nosql db". Keeping these two topics apart is like saying, OK let us separate DBMS from indexing and SQL. Search is great, nosqldb-s are great, but not enough.

"traditional search" is just one application, useful, but just one application. More traditional, and much more general computation model is to have some way to locate data (old way "SQL", new way "search"), retrieve data (old way "SQL", new way nosql KV stores), do something with data (SQL vs map-reduce today on mega-data) and put it back to storage/deliver outside.

What I am trying to say, the "new way" has one missing link, keeps data in two completely separate worlds, technologically and logically apart (think e.g. hbase and ES or cassandra and solr). This is expensive, hard to setup, hard to keep in sync, duplicates demand on resources ...

In ideal world, imagine hbase where each node keeps embedded lucene to expose search part with all this magic Shay is doing with ES. This would become one infrastructure to keep all players in sync , one set of APIs to talk to clients... It Seams riak goes this way.

I think I see this way of thinking behind ES, so imagine ES doing map-reduce, keeping your data safe like hbase... :slight_smile:

Dreaming in public is, I guess, OK

Cheers,
Eks


(Kosta) #19

I'm glad this discussion took off, as it is something that I have been
pondering about for a while now as well.

For my latest project I started off with a large tech stack... web
framework, database, message queue, distributed file system, search
index etc. Life was great, everything was modular and decoupled and it
was all going to fit into place beautifully. In the test environment I
set up half a dozen virtual machines, each running their own component
so that I have nice isolation and can easily pinpoint bottlenecks.

Unfortunately, marvelling over this grandiose architecture was short
lived. It wasn't long before I started feeling the pain of keeping up
with the latest and greatest for each of these components. Learning
their hidden pitfalls and secrets. Then I started thinking about
scaling out, and even though all these components were elastic, cloud-
ready and , each one had a different way of sharding
and replicating. So now I had to learn how to scale, monitor,
optimize, back up and configure 4 different technologies written in
different languages and having different dependencies. My head started
to hurt, it was time for a change of plan, for a new mantra -
sometimes simpler is better!

In my particular case, I used mongo as my nosql store and I was
definitely seeing a bit of an overlap between it and ES. The type of
data was simple and I didn't have a need for map-reduce operations or
complex set relations (otherwise I wouldn't be using a nosql solution
in the first place!), I just needed a flexible data model and a fine-
grained way to search & retrieve documents, which is what elastic
search was made for in the first place. The fact that I could
partition and replicate my data using elastic search, in a way
reminiscent of mongo made the question of why even more obvious.

So I took the plunge and decided to ditch mongo for the time being and
use ES as a primary form of storage. I asked around on groups and
forums and couldn't find any glaringly obvious problem with using
lucene as a storage engine. I also looked at Terrastore briefly but
couldn't really see from the architecture diagrams what it uses for
persistence. I assume Terracotta; but based on what I read so far,
terracotta is not really well suited for permanent data but rather
throw-away data. It was interesting to see Sergio mentioning under one
of his points that "Lucene isn't intended as a storage solution; it
may or may not work for your needs, but again, that's not the intended
use (and in my own experience, it doesn't work)". I think this is
something that would be worthwhile analysing and providing real use
cases and war stories of particular situations where lucene was not a
good storage solution and where it doesn't work (and how Terrastore
addresses and solves them).

TL;DR Large tech stacks can quickly turn into administrative/learning
nightmares. Sometimes the benefits of integrating multiple solutions
into one component can far outweigh the risks and problems, especially
in a case like this where many people are confused and already see an
overlap (i.e. using ES as a nosql store).

On Mar 17, 7:29 pm, Eks Dev eks...@googlemail.com wrote:

I just started playing with ES and had to comment this subject.

imo, this question in subject (discussion is great!) is plain wrong. What I
would like to see somewhere is rather search and "nosql db". Keeping these
two topics apart is like saying, OK let us separate DBMS from indexing and
SQL. Search is great, nosqldb-s are great, but not enough.

"traditional search" is just one application, useful, but just one
application. More traditional, and much more general computation model is to
have some way to locate data (old way "SQL", new way "search"), retrieve
data (old way "SQL", new way nosql KV stores), do something with data (SQL
vs map-reduce today on mega-data) and put it back to storage/deliver
outside.

What I am trying to say, the "new way" has one missing link, keeps data in
two completely separate worlds, technologically and logically apart (think
e.g. hbase and ES or cassandra and solr). This is expensive, hard to setup,
hard to keep in sync, duplicates demand on resources ...

In ideal world, imagine hbase where each node keeps embedded lucene to
expose search part with all this magic Shay is doing with ES. This would
become one infrastructure to keep all players in sync , one set of APIs to
talk to clients... It Seams riak goes this way.

I think I see this way of thinking behind ES, so imagine ES doing
map-reduce, keeping your data safe like hbase... :slight_smile:

Dreaming in public is, I guess, OK

Cheers,
Eks

--
View this message in context:http://elasticsearch-users.115913.n3.nabble.com/ElasticSearch-vs-NoSQ...
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(Kosta) #20

Just realized that Eks replied to a year old thread... Sorry for joining the thread resurrection like this, but I guess that makes it still somewhat relevant a year later :slight_smile: