Integration with NoSQL

I just read this post:
http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/e19eab5060d1bd55/246bc792b4415d69?lnk=gst&q=mongodb#246bc792b4415d69

and this article:

Yes, it'll be great if storage and search were integrated. I'm
currently using
MongoDB for storage and Solr for search (and research on integration
of
the two led me to ElasticSearch.) I have to implement CRUD for both
stores.

I'm not aware of an event mechanism in MongoDB for CRUD operations.
But I think one thing that might be useful for ES to MongoDB
integration,
at least for a particular language, is to integrate through the
language drivers.
This really means that work will have to be done for each language
but at least it's possible. Every language has a HTTP client library
so hopefully
adding a REST call to ES (or other search engines but particularly ES
for ease
of use of its HTTP/REST interface ) where data is changed in the NoSQL
data
store should not be too much work. Just an idea :slight_smile:

Jack

If you are like us, at some point you might be wondering why you are using
MongoDB at all, and just use ES as your datastore. :slight_smile:

On Sat, Oct 2, 2010 at 2:28 PM, JList jlist9@gmail.com wrote:

I just read this post:

http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/e19eab5060d1bd55/246bc792b4415d69?lnk=gst&q=mongodb#246bc792b4415d69

and this article:
http://www.elasticsearch.com/blog/2010/02/25/nosql_yessearch.html

Yes, it'll be great if storage and search were integrated. I'm
currently using
MongoDB for storage and Solr for search (and research on integration
of
the two led me to Elasticsearch.) I have to implement CRUD for both
stores.

I'm not aware of an event mechanism in MongoDB for CRUD operations.
But I think one thing that might be useful for ES to MongoDB
integration,
at least for a particular language, is to integrate through the
language drivers.
This really means that work will have to be done for each language
but at least it's possible. Every language has a HTTP client library
so hopefully
adding a REST call to ES (or other search engines but particularly ES
for ease
of use of its HTTP/REST interface ) where data is changed in the NoSQL
data
store should not be too much work. Just an idea :slight_smile:

Jack

I've been thinking that :slight_smile: Part of the problem is that with Solr, the search
engine I'm currently using, insert and delete are fine but update is a bit
awkward. It doesn't support field update. Instead, you'd have to delete
a document and re-index it. I think it's the same with ES. (Please do
correct me if I'm wrong. I've just read a small portion of the docs.)

Another reason I'm using Mongodb is that there are ORMs available
such as Morphia, which makes accessing data from Java easier. And it
also sort of enforces a schema concept with the Entity classes while
still easy to change, thanks to the flexibility provided by mongodb.
ORMs are generally not available for search engines as far as I know.

But it's definitely encouraging to learn that some of the ES users are
successfully using it as the datastores!

A side question - does ES's take full advantage of Lucene's text search
capability? For example, is there a way to specify a high weight for
a title field than a content field? Will a doc get higher score if query
terms appear more in a doc or are adjacent to each other?

Thanks,
Jack

On Sat, Oct 2, 2010 at 12:01 PM, James Cook jcook@tracermedia.com wrote:

If you are like us, at some point you might be wondering why you are using
MongoDB at all, and just use ES as your datastore. :slight_smile:

On Sat, Oct 2, 2010 at 2:28 PM, JList jlist9@gmail.com wrote:

I just read this post:

http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/e19eab5060d1bd55/246bc792b4415d69?lnk=gst&q=mongodb#246bc792b4415d69

and this article:
NoSQL, Yes Search | Elastic Blog

Yes, it'll be great if storage and search were integrated. I'm
currently using
MongoDB for storage and Solr for search (and research on integration
of
the two led me to Elasticsearch.) I have to implement CRUD for both
stores.

I'm not aware of an event mechanism in MongoDB for CRUD operations.
But I think one thing that might be useful for ES to MongoDB
integration,
at least for a particular language, is to integrate through the
language drivers.
This really means that work will have to be done for each language
but at least it's possible. Every language has a HTTP client library
so hopefully
adding a REST call to ES (or other search engines but particularly ES
for ease
of use of its HTTP/REST interface ) where data is changed in the NoSQL
data
store should not be too much work. Just an idea :slight_smile:

Jack

On Sat, 2010-10-02 at 12:37 -0700, jlist9 wrote:

I've been thinking that :slight_smile: Part of the problem is that with Solr, the search
engine I'm currently using, insert and delete are fine but update is a bit
awkward. It doesn't support field update. Instead, you'd have to delete
a document and re-index it. I think it's the same with ES. (Please do
correct me if I'm wrong. I've just read a small portion of the docs.)

You don't have to delete it, you just "index" the doc again, and it will
take care of removing the old copy. (as opposed to "create", which
won't check for existing versions )

But it's definitely encouraging to learn that some of the ES users are
successfully using it as the datastores!

kimchy, the developer, recommends against this at the moment, at least
until ES reaches 1.0 - Things are still changing, and recent releases
have changed the long term storage and temporary work storage structure
and required re-indexing, so you still need your data available
elsewhere.

A side question - does ES's take full advantage of Lucene's text search
capability? For example, is there a way to specify a high weight for
a title field than a content field? Will a doc get higher score if query
terms appear more in a doc or are adjacent to each other?

Yes - it exposes all (most?) of Lucene's API via its query DSL, and adds
a few really nice features.

See: http://www.elasticsearch.com/docs/elasticsearch/rest_api/query_dsl/

clint

Hi Clint,

Thanks for the reply.

You don't have to delete it, you just "index" the doc again, and it will
take care of removing the old copy.

Good to know this.

kimchy, the developer, recommends against this at the moment, at least
until ES reaches 1.0 - Things are still changing, and recent releases
have changed the long term storage and temporary work storage structure
and required re-indexing, so you still need your data available
elsewhere.

I see. Thanks for explaining. I'll keep a separate datastore for now.
But I guess the moral of the story is that everyone is hoping for an
integrated datastore and search engine :slight_smile: Going back to my original
email, maybe ES will take off if some language driver level integration
are implemented for the popular nosql stores such as mongodb. All
the user will need to do is to update the mongodb docs and the
ES index is updated automagically!

Yes - it exposes all (most?) of Lucene's API via its query DSL, and adds
a few really nice features.
See: http://www.elasticsearch.com/docs/elasticsearch/rest_api/query_dsl/

That's great! Although I'm currently using solr, I really like the distributed
nature (although I don't need it now but it's there if you need it) and the
simple REST interface of ES. I'll definitely keep it in mind when I start
my next project.

Jack

We use Hazelcast in front of Elastic Search. All data is put into Hazelcast
first which then has a bridge that writes it to ES. We also write to MySQL
at the same time, until ES rises from beta.

Hazelcast gives us a distributed cache for identity retrieval, while we go
directly to ES for queries. Hazelcast also gives us the transactional
controls that MongoDB and ES lack.

-- jim

On Sat, Oct 2, 2010 at 4:03 PM, jlist9 jlist9@gmail.com wrote:

Hi Clint,

Thanks for the reply.

You don't have to delete it, you just "index" the doc again, and it will
take care of removing the old copy.

Good to know this.

kimchy, the developer, recommends against this at the moment, at least
until ES reaches 1.0 - Things are still changing, and recent releases
have changed the long term storage and temporary work storage structure
and required re-indexing, so you still need your data available
elsewhere.

I see. Thanks for explaining. I'll keep a separate datastore for now.
But I guess the moral of the story is that everyone is hoping for an
integrated datastore and search engine :slight_smile: Going back to my original
email, maybe ES will take off if some language driver level integration
are implemented for the popular nosql stores such as mongodb. All
the user will need to do is to update the mongodb docs and the
ES index is updated automagically!

Yes - it exposes all (most?) of Lucene's API via its query DSL, and adds
a few really nice features.
See: http://www.elasticsearch.com/docs/elasticsearch/rest_api/query_dsl/

That's great! Although I'm currently using solr, I really like the
distributed
nature (although I don't need it now but it's there if you need it) and the
simple REST interface of ES. I'll definitely keep it in mind when I start
my next project.

Jack

To all wondering why you might still need a nosql solution next to
Elasticsearch here's my two cents.

Elasticsearch is still NRT whereas with (most) nosql document stores adding
a document will make it available immediately afterwards (Real time).
Another reason is indexing only those fields you want to search and storing
only those fields that needs to be shown (on overviews, lists, dropdowns
etcetera) can be a huge query performance booster. A nosql solution is more
geared towards quickly looking up object (graphs) by ID as fast as possible
and querying for ranges of objects is limited or crude.

In web applications using ES for overviews, lists, dropdowns and nosql for
item pages (view, edit objects) really gives you best of both worlds.

Side Note: jlist9, updating a SOLR document should not require a delete
prior either.

On Sat, Oct 2, 2010 at 10:36 PM, James Cook jcook@tracermedia.com wrote:

We use Hazelcast in front of Elastic Search. All data is put into Hazelcast
first which then has a bridge that writes it to ES. We also write to MySQL
at the same time, until ES rises from beta.

Hazelcast gives us a distributed cache for identity retrieval, while we go
directly to ES for queries. Hazelcast also gives us the transactional
controls that MongoDB and ES lack.

-- jim

On Sat, Oct 2, 2010 at 4:03 PM, jlist9 jlist9@gmail.com wrote:

Hi Clint,

Thanks for the reply.

You don't have to delete it, you just "index" the doc again, and it will
take care of removing the old copy.

Good to know this.

kimchy, the developer, recommends against this at the moment, at least
until ES reaches 1.0 - Things are still changing, and recent releases
have changed the long term storage and temporary work storage structure
and required re-indexing, so you still need your data available
elsewhere.

I see. Thanks for explaining. I'll keep a separate datastore for now.
But I guess the moral of the story is that everyone is hoping for an
integrated datastore and search engine :slight_smile: Going back to my original
email, maybe ES will take off if some language driver level integration
are implemented for the popular nosql stores such as mongodb. All
the user will need to do is to update the mongodb docs and the
ES index is updated automagically!

Yes - it exposes all (most?) of Lucene's API via its query DSL, and adds
a few really nice features.
See:
http://www.elasticsearch.com/docs/elasticsearch/rest_api/query_dsl/

That's great! Although I'm currently using solr, I really like the
distributed
nature (although I don't need it now but it's there if you need it) and
the
simple REST interface of ES. I'll definitely keep it in mind when I start
my next project.

Jack

Hey,

Excellent discussion, let me add my view on some of the
things referred to here:

There are several reasons you would like to add a nosql or even database
(shudder :wink: ) alongside elasticsearch, but the main one is that they all
have really cool features that elasticsearch does not provide (some of it
will never provide due to the "search" nature). Some examples include
relational model (you hear that, I called this cool ; ) ), cross operation
transactions, or nosql specific features such as couchdb unique way of
handling changes.

Other features might eventually end up implemented in elasticsearch, such as
full real time, or "parent child" relationships.

Regarding storing just what you want to search on and be displayed, then I
agree as well. Note that no matter how fast your nosql if choice is, its
always faster to return the data as part of the search request, then
returning a list of ids and looking them up in a nosql. By the way, if
someone is up to an interesting test, then test elasticsearch key based
lookup performance against other nosqls, you will be surprised at the
results (and elasticsearch does no caching on key based lookups, except
for the os file system cache of course).

Regarding mongodb, then yes, sadly, the only options I see in integrating it
with elasticsearch is by doing it on the "application" layer, by applying
the same operations done on both mongodb and elasticsearch. This can be
abstracted away within the mongodb client driver of choice, which would make
life much simple.

If mongodb had a post commit hooks, or a way to get the stream of changes
happening on it, then other integration points would have been possible.

Regarding the "update" option. An update is a delete and then reindexing
that document. This is called "index" in elasticsearch and is actually the
default mode when indexing data. There is an option to "create" a document,
which will do no deletion in advance, which will result in better
performance, but if you have two "creates" with the same id, then two
documents will exists for it.

-shay.banon

On Sun, Oct 3, 2010 at 3:26 AM, Martijn Laarman m.laarman@datheon.comwrote:

To all wondering why you might still need a nosql solution next to
Elasticsearch here's my two cents.

Elasticsearch is still NRT whereas with (most) nosql document stores adding
a document will make it available immediately afterwards (Real time).
Another reason is indexing only those fields you want to search and storing
only those fields that needs to be shown (on overviews, lists, dropdowns
etcetera) can be a huge query performance booster. A nosql solution is more
geared towards quickly looking up object (graphs) by ID as fast as possible
and querying for ranges of objects is limited or crude.

In web applications using ES for overviews, lists, dropdowns and nosql for
item pages (view, edit objects) really gives you best of both worlds.

Side Note: jlist9, updating a SOLR document should not require a delete
prior either.

On Sat, Oct 2, 2010 at 10:36 PM, James Cook jcook@tracermedia.com wrote:

We use Hazelcast in front of Elastic Search. All data is put into
Hazelcast first which then has a bridge that writes it to ES. We also write
to MySQL at the same time, until ES rises from beta.

Hazelcast gives us a distributed cache for identity retrieval, while we go
directly to ES for queries. Hazelcast also gives us the transactional
controls that MongoDB and ES lack.

-- jim

On Sat, Oct 2, 2010 at 4:03 PM, jlist9 jlist9@gmail.com wrote:

Hi Clint,

Thanks for the reply.

You don't have to delete it, you just "index" the doc again, and it
will
take care of removing the old copy.

Good to know this.

kimchy, the developer, recommends against this at the moment, at least
until ES reaches 1.0 - Things are still changing, and recent releases
have changed the long term storage and temporary work storage structure
and required re-indexing, so you still need your data available
elsewhere.

I see. Thanks for explaining. I'll keep a separate datastore for now.
But I guess the moral of the story is that everyone is hoping for an
integrated datastore and search engine :slight_smile: Going back to my original
email, maybe ES will take off if some language driver level integration
are implemented for the popular nosql stores such as mongodb. All
the user will need to do is to update the mongodb docs and the
ES index is updated automagically!

Yes - it exposes all (most?) of Lucene's API via its query DSL, and
adds
a few really nice features.
See:
http://www.elasticsearch.com/docs/elasticsearch/rest_api/query_dsl/

That's great! Although I'm currently using solr, I really like the
distributed
nature (although I don't need it now but it's there if you need it) and
the
simple REST interface of ES. I'll definitely keep it in mind when I start
my next project.

Jack

Hi guys,

I'd like to add my view about the topic too, because involved in both
ElasticSearch and NoSQL development (I'm the author of Terrastore and
its ElasticSearch integration).

I think the most important reason to use ElasticSearch (ES) together
with a NoSQL solution, so without storing source documents into ES
indexes, resides in the performance degradation the Lucene directory
is subjected to as its size grows.
I experienced such a pain personally when working with a clustered
Lucene implementation in the past: ES addresses the problem by
sharding into more directories, so reducing their size, but I know of
people which is still experiencing the same problem with directory
shards growing and growing due to the stored documents rather than the
indexes by themselves; in such a case, adding more shards is obviously
the suggested solution, which anyways comes with costs.

So, I think the best of both worlds would be a two-steps solution:

  1. Post-commit hooks in NoSQL stores to automatically send data to
    index to ES, as already cited in previous mails.
  2. A pluggable API in ES to automatically retrieve the document source
    from the NoSQL store by key: this would still imply several network
    hops, but would at least make the developer life easier.

My two euro cents,
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Hi Sergio

I think the most important reason to use Elasticsearch (ES) together
with a NoSQL solution, so without storing source documents into ES
indexes, resides in the performance degradation the Lucene directory
is subjected to as its size grows.

At what size of Lucene directory did you start experiencing performance
degradation - this could be a useful guide for us ES users as to the
number of shards that we should use for our data.

And are you saying that just storing the doc in ES has a similar
performance impact as index size?

thanks

clint

Hi,

Storing the actual source of course causes more overhead when indexing a
document (basically, when a lucene segment is written, more data needs to be
written to disk), and when segments are merged (again, more IO operations,
but very trivial ones). This is a very minimal cost compared to all the
other things that happens when indexing a document, so I am surprised of
your experience Sergio..., I think there might have been something else in
play here... .

On the other hand, not storing the source document means several things.
First, you need to fetch the relevant data from your "other storage",
possibly a nosql. Now, lets compare the cost of this: When executing a
search, the "fetching" phase is already executing on the relevant node, so
all that is added is accessing the shard storage (lets say fs) and fetching
it.

On the other hand, lets assume you get back just the list of ids, and you
use those to fetch the relevant data. This requires at most remote N calls
(depends on the nosql if it has batch get, but in any case, those will
probably be translated to hitting different nodes in the cluster). This
means you pay the price of N remote calls. Then, depending on the
implementation details of the nosql in question, also means fetching that
data from its "persistent" storage. This is an order of magnitude slower
than fetching it from elasticsearch, which is highlighted even more since
most nosql solution don't provide true async transport API.

Thats why, by default, elasticsearch does store the _source, and I think
that its a good "out of the box" solution, and this is what I would
recommend on using most (if not almost always) of the time. For cases where
you don't want to, you can always disable it.

-shay.banon

On Wed, Oct 6, 2010 at 1:54 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

Hi Sergio

I think the most important reason to use Elasticsearch (ES) together
with a NoSQL solution, so without storing source documents into ES
indexes, resides in the performance degradation the Lucene directory
is subjected to as its size grows.

At what size of Lucene directory did you start experiencing performance
degradation - this could be a useful guide for us ES users as to the
number of shards that we should use for our data.

And are you saying that just storing the doc in ES has a similar
performance impact as index size?

thanks

clint

On the other hand, lets assume you get back just the list of ids,
and you use those to fetch the relevant data. This requires at most
remote N calls (depends on the nosql if it has batch get, but in any
case, those will probably be translated to hitting different nodes in
the cluster). This means you pay the price of N remote calls. Then,
depending on the implementation details of the nosql in question, also
means fetching that data from its "persistent" storage. This is an
order of magnitude slower than fetching it from elasticsearch, which
is highlighted even more since most nosql solution don't provide
true async transport API.

The approach I use is:

  • the data representation stored in ES is not the same as the data
    structure in my object

  • if I do a search in ES, I retrieve 300 IDs, without _source, which I
    then cache in memcached - a short expiration time is acceptable to me

  • I then retrieve (eg) the 10 IDs (one page of results) that I need
    for this particular request, and instantiate each object from
    (1) memcached if present (very fast), or
    (2) the DB, which I do in a single query, and then cache
    the objects to memcached

  • when objects are updated, the cached version is also updated, and
    my typical related non-keyword queries are removed from the cache

clint

On Wed, Oct 6, 2010 at 1:54 PM, Clinton Gormley clinton@iannounce.co.uk wrote:

At what size of Lucene directory did you start experiencing performance
degradation - this could be a useful guide for us ES users as to the
number of shards that we should use for our data.

It depends: I was experiencing problems with just one gigabyte of
indexes, mainly small documents but with lots of indexed fields.
Another company I know of is experiencing problems with lots of
documents (several gigabytes) but very few fields each.

Obviously, don't take my words for reference, and always experiment in
your own environment and use case.

And are you saying that just storing the doc in ES has a similar
performance impact as index size?

I'm just saying that increasing index size decreases write
performance: can't honestly report the exact weight.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Hi Shay,

thanks for your response.
My thoughts below ...

Storing the actual source of course causes more overhead when indexing a
document (basically, when a lucene segment is written, more data needs to be
written to disk), and when segments are merged (again, more IO operations,
but very trivial ones). This is a very minimal cost compared to all the
other things that happens when indexing a document, so I am surprised of
your experience Sergio..., I think there might have been something else in
play here...

There might have been something else, I'm no way an expert in Lucene
internals: I'm just reporting my experiences.

On the other hand, not storing the source document means several things.
First, you need to fetch the relevant data from your "other storage",
possibly a nosql. Now, lets compare the cost of this: When executing a
search, the "fetching" phase is already executing on the relevant node, so
all that is added is accessing the shard storage (lets say fs) and fetching
it.

Agree, reading everything from ES will be certainly faster: but I was
talking about write performance degradation.

Thats why, by default, elasticsearch does store the _source, and I think
that its a good "out of the box" solution,

I'm not saying it is not a good "out of the box" solution: it actually
is indeed, so I agree with you there :slight_smile:

and this is what I would
recommend on using most (if not almost always) of the time.

Here's where I disagree, but it's just my opinion.

Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob