MongoDB + SOLR integration


(shreyas) #1

Guys,
People might have already asked question about ES + MongoDB
integration.

But google didn't return any result for "mongodb integration" :stuck_out_tongue:

So, has anyone successfully integrated mongodb with ES?

Is SOLR better to integrate with MongoDB? Any recommendations?
Thanks,
Shreyas


(jjasinek) #2

I'm not sure if we can say that one is better than the other without
understanding what your goals are. However, knowing MongoDB as well,
I would imagine you choose that for its ability to store schema-free
documents and its ability for shards and replica sets because your
dataset is growing. If so, than you probably want ElasticSearch as
well just for those same features plus the added capability of NRT
search.

If so you might want to check out http://www.matt-reid.co.uk/blog_post.php?id=68#&slider1=4
and https://github.com/aparo/elasticsearch/tree/master/plugins/river/mongodb
for an ElasticSearch river that reads the MongoDB oplog.

On Nov 15, 8:46 am, Shreyas Desai shre...@bhagda.com wrote:

Guys,
People might have already asked question about ES + MongoDB
integration.

But google didn't return any result for "mongodb integration" :stuck_out_tongue:

So, has anyone successfully integrated mongodb with ES?

Is SOLR better to integrate with MongoDB? Any recommendations?
Thanks,
Shreyas


(Marc Seeger-2) #3

As far as Solr goes, there is also
this: https://github.com/mikejs/photovoltaic
But the river sounds niceer


(Alex Piggott) #4

Hi, our project uses MongoDB as a persistent data store (with rapidly
changing fields in documents), and elasticsearch as an index for the
invariant/slowly-changing fields in front.

We have about 100M documents (10M large, 100M small; average 100
fields/large-doc, 10 fields/small-doc) in a 6-node pseudo-operational
cluster (3 elasticsearch and 3 MongoDB). We are expecting our
operational deployments to be in the 10-20 node range.

We moved to elasticsearch from SOLR. elasticsearch was far easier to
integrate because both MongoDB and elasticsearch "naturally" use JSON.
(Actually our initial reason for migrating was geo-spatial
functionality, the fact we could retire lots of code and have fewer
problems when schemas changed was a nice secondary benefit!)

A few aspects of our integration:

  • We don't run MongoDB and elasticsearch on the same "physical" (/
    logical) nodes because they are both pretty memory and disk-bandwidth
    hungry (MongoDB more so than elasticsearch, we hardly run anything
    else on our MongoDB node). The two instances should be connected by a
    fast LAN though.

  • We don't use a river to synchronize them - we control all
    insertions/deletions/modifications into the data store, therefore we
    can "mirror" the objects at that point (this is more efficient and
    also allows us to transform the objects to take advantage of
    elasticsearch-specific features in eg geo).

(That said, we had to write a custom ORM to support this maintainably,
so that was a downside - for smaller prototype projects this shouldn't
be necessary however: just convert the object to both JSON and BSON -
eg using "gson", insert one into elasticsearch and one into mongodb)

  • We retrieve the documents from MongoDB based on the (common) "_ids"
    returned from elasticsearch. This part gets a "C+" at best -
    elasticsearch is really fast, the MongoDB "$in" query is really fast,
    but returning all the (large) documents from MongoDB is a bit slow
    (1000 can take about 1.5s, dominated by network IO). I think MongoDB
    have scope to speed up their network IO but it's acceptable for the
    moment.

(If you need to perform analytics on very large numbers of documents
defined by a search, rather than "just" return the results of
searches, this method of integration may not be suitable. We're
investigating tighter coupling between the 2 platforms for this
purpose at the moment. FWIW I asked one of the 10gen lead engineers if
they had any tricks up their sleeves and he couldn't think of anything
on the spot.)

  • We have a separate process for monitoring the synchronization
    between the 2, somewhat similar to the scrutineer someone just posted
    (I had a quick look at the code, and it looked like it would be very
    easy to write a MongoDB driver to go along with the existing JDBC
    ones).

  • Not really, an integration issue, but elasticsearch is far easier
    to distribute across multiple nodes than MongoDB!

I can't think of anything else off the top of my head.

So in summary, elasticsearch integrates with MongoDB much better than
SOLR (as well as being better for our application in many other ways).
It's easy to get up-and-running, though there's a few issues for
bigger/more complex code.

On Nov 15, 9:46 am, Shreyas Desai shre...@bhagda.com wrote:

Guys,
People might have already asked question about ES + MongoDB
integration.

But google didn't return any result for "mongodb integration" :stuck_out_tongue:

So, has anyone successfully integrated mongodb with ES?

Is SOLR better to integrate with MongoDB? Any recommendations?
Thanks,
Shreyas


(Alex Piggott) #5

Oh forgot one other important integration point:

Nested arrays of objects inside documents is a common thing to have in
JSON/MongoDB (eg a list of "geo" objects consisting of place names,
country of origin, lat/longs, etc), but is really badly (*) supported
by SOLR (or at least was back when we still used it!).

(*) (The main limitation was boolean searching within objects, eg
"placename='London' AND country='Canada'" would return parent
documents that contained London in one array element and Canada in
another.)

In elasticsearch 0.16.x, we used the parent/child infrastructure to
enable correct boolean searching and it worked reasonably well
functionally, though "worst case" searches could get a bit slow (this
was another reason we used our own mirror code vs rivers - it's a bit
fiddly to handle in code but not too terrible).

Later versions of elasticsearch apparently have a more naturally way
of embedding child objects into documents, which should make things
even simpler and faster (we're currently migrating to 0.17.x so I
haven't played with that yet).


(shreyas) #6

Wow. Thanks for detailed reply. Appreciate you guys taking time to
help me out.

My preferred way is to have river or something similar.

In 2nd case I have to add layer to update data which would update both
mongo and ES

Then periodically run checks to see if both mongo and ES are in sync.

regards,
Shreyas

On Nov 16, 12:11 am, Alex at Ikanow apigg...@ikanow.com wrote:

Hi, our project uses MongoDB as a persistent data store (with rapidly
changing fields in documents), and elasticsearch as an index for the
invariant/slowly-changing fields in front.

We have about 100M documents (10M large, 100M small; average 100
fields/large-doc, 10 fields/small-doc) in a 6-node pseudo-operational
cluster (3 elasticsearch and 3 MongoDB). We are expecting our
operational deployments to be in the 10-20 node range.

We moved to elasticsearch from SOLR. elasticsearch was far easier to
integrate because both MongoDB and elasticsearch "naturally" use JSON.
(Actually our initial reason for migrating was geo-spatial
functionality, the fact we could retire lots of code and have fewer
problems when schemas changed was a nice secondary benefit!)

A few aspects of our integration:

  • We don't run MongoDB and elasticsearch on the same "physical" (/
    logical) nodes because they are both pretty memory and disk-bandwidth
    hungry (MongoDB more so than elasticsearch, we hardly run anything
    else on our MongoDB node). The two instances should be connected by a
    fast LAN though.

  • We don't use a river to synchronize them - we control all
    insertions/deletions/modifications into the data store, therefore we
    can "mirror" the objects at that point (this is more efficient and
    also allows us to transform the objects to take advantage of
    elasticsearch-specific features in eg geo).

(That said, we had to write a custom ORM to support this maintainably,
so that was a downside - for smaller prototype projects this shouldn't
be necessary however: just convert the object to both JSON and BSON -
eg using "gson", insert one into elasticsearch and one into mongodb)

  • We retrieve the documents from MongoDB based on the (common) "_ids"
    returned from elasticsearch. This part gets a "C+" at best -
    elasticsearch is really fast, the MongoDB "$in" query is really fast,
    but returning all the (large) documents from MongoDB is a bit slow
    (1000 can take about 1.5s, dominated by network IO). I think MongoDB
    have scope to speed up their network IO but it's acceptable for the
    moment.

(If you need to perform analytics on very large numbers of documents
defined by a search, rather than "just" return the results of
searches, this method of integration may not be suitable. We're
investigating tighter coupling between the 2 platforms for this
purpose at the moment. FWIW I asked one of the 10gen lead engineers if
they had any tricks up their sleeves and he couldn't think of anything
on the spot.)

  • We have a separate process for monitoring the synchronization
    between the 2, somewhat similar to the scrutineer someone just posted
    (I had a quick look at the code, and it looked like it would be very
    easy to write a MongoDB driver to go along with the existing JDBC
    ones).

  • Not really, an integration issue, but elasticsearch is far easier
    to distribute across multiple nodes than MongoDB!

I can't think of anything else off the top of my head.

So in summary, elasticsearch integrates with MongoDB much better than
SOLR (as well as being better for our application in many other ways).
It's easy to get up-and-running, though there's a few issues for
bigger/more complex code.

On Nov 15, 9:46 am, Shreyas Desai shre...@bhagda.com wrote:

Guys,
People might have already asked question about ES + MongoDB
integration.

But google didn't return any result for "mongodb integration" :stuck_out_tongue:

So, has anyone successfully integrated mongodb with ES?

Is SOLR better to integrate with MongoDB? Any recommendations?
Thanks,
Shreyas


(Timo Mika Gläßer) #7

Hey Alex, what types of nodes are you running for your setup. Specs...
curious because we're in the planning / rollout phase for our product.

Kind regards
Timo

On 15 Nov., 14:11, Alex at Ikanow apigg...@ikanow.com wrote:

Hi, our project uses MongoDB as a persistent data store (with rapidly
changing fields in documents), and elasticsearch as an index for the
invariant/slowly-changing fields in front.

We have about 100M documents (10M large, 100M small; average 100
fields/large-doc, 10 fields/small-doc) in a 6-node pseudo-operational
cluster (3 elasticsearch and 3 MongoDB). We are expecting our
operational deployments to be in the 10-20 node range.

We moved to elasticsearch from SOLR. elasticsearch was far easier to
integrate because both MongoDB and elasticsearch "naturally" use JSON.
(Actually our initial reason for migrating was geo-spatial
functionality, the fact we could retire lots of code and have fewer
problems when schemas changed was a nice secondary benefit!)

A few aspects of our integration:

  • We don't run MongoDB and elasticsearch on the same "physical" (/
    logical) nodes because they are both pretty memory and disk-bandwidth
    hungry (MongoDB more so than elasticsearch, we hardly run anything
    else on our MongoDB node). The two instances should be connected by a
    fast LAN though.

  • We don't use a river to synchronize them - we control all
    insertions/deletions/modifications into the data store, therefore we
    can "mirror" the objects at that point (this is more efficient and
    also allows us to transform the objects to take advantage of
    elasticsearch-specific features in eg geo).

(That said, we had to write a custom ORM to support this maintainably,
so that was a downside - for smaller prototype projects this shouldn't
be necessary however: just convert the object to both JSON and BSON -
eg using "gson", insert one into elasticsearch and one into mongodb)

  • We retrieve the documents from MongoDB based on the (common) "_ids"
    returned from elasticsearch. This part gets a "C+" at best -
    elasticsearch is really fast, the MongoDB "$in" query is really fast,
    but returning all the (large) documents from MongoDB is a bit slow
    (1000 can take about 1.5s, dominated by network IO). I think MongoDB
    have scope to speed up their network IO but it's acceptable for the
    moment.

(If you need to perform analytics on very large numbers of documents
defined by a search, rather than "just" return the results of
searches, this method of integration may not be suitable. We're
investigating tighter coupling between the 2 platforms for this
purpose at the moment. FWIW I asked one of the 10gen lead engineers if
they had any tricks up their sleeves and he couldn't think of anything
on the spot.)

  • We have a separate process for monitoring the synchronization
    between the 2, somewhat similar to the scrutineer someone just posted
    (I had a quick look at the code, and it looked like it would be very
    easy to write a MongoDB driver to go along with the existing JDBC
    ones).

  • Not really, an integration issue, but elasticsearch is far easier
    to distribute across multiple nodes than MongoDB!

I can't think of anything else off the top of my head.

So in summary, elasticsearch integrates with MongoDB much better than
SOLR (as well as being better for our application in many other ways).
It's easy to get up-and-running, though there's a few issues for
bigger/more complex code.

On Nov 15, 9:46 am, Shreyas Desai shre...@bhagda.com wrote:

Guys,
People might have already asked question about ES + MongoDB
integration.

But google didn't return any result for "mongodb integration" :stuck_out_tongue:

So, has anyone successfully integrated mongodb with ES?

Is SOLR better to integrate with MongoDB? Any recommendations?
Thanks,
Shreyas


(system) #8