I have a bunch of documents stored in mongo that I'm indexing with elasticsearch.
Would there happen to be a handy sync script/tool to keep elastic search up to
date wrt a mongo container...? Anyone on the list been down this road before and
care to share some wisdom?
I know that some expressed interest in it. I don't have much
mongo experience (actually, none...), but some suggestions can be either
hook into a "post commit" hook in mongo, or query it in some way for latest
changes? Another option, which is more on the application side, is to apply
the same changes done to mongo to elasticsearch as well. Sure you thought of
these already ...
-shay.banon
On Tue, Aug 3, 2010 at 10:01 PM, John Merrells merrells@gmail.com wrote:
Hello,
I have a bunch of documents stored in mongo that I'm indexing with
elasticsearch.
Would there happen to be a handy sync script/tool to keep Elasticsearch up
to
date wrt a mongo container...? Anyone on the list been down this road
before and
care to share some wisdom?
I know that some expressed interest in it. I don't have much mongo experience (actually, none...), but some suggestions can be either hook into a "post commit" hook in mongo, or query it in some way for latest changes? Another option, which is more on the application side, is to apply the same changes done to mongo to elasticsearch as well. Sure you thought of these already ...
My curent plan is to have an updated_at datetime field on each document.
With that I can ask Elasticsearch what the max datetime it has and then on
mongo ask for all the documents with datetime greater than that. The only
twisty bit so far is that the datetime is going to need to be stored as a string
on mongo, and a long on elasticsearch, but in theory it should work.
I know that some expressed interest in it. I don't have much mongo
experience (actually, none...), but some suggestions can be either hook into
a "post commit" hook in mongo, or query it in some way for latest changes?
Another option, which is more on the application side, is to apply the same
changes done to mongo to elasticsearch as well. Sure you thought of these
already ...
My curent plan is to have an updated_at datetime field on each document.
With that I can ask Elasticsearch what the max datetime it has and then on
mongo ask for all the documents with datetime greater than that. The only
twisty bit so far is that the datetime is going to need to be stored as a
string
on mongo, and a long on elasticsearch, but in theory it should work.
I got there because I started looking at the Date type, but on mongo
its hard to query it through the protocol, and on elasticsearch there's
no max for date... so then I tried string.... then long.... so yeah I should
just use long on both....
Why float? I assumed long and represent it as a timestamp. Date type in elasticsearch is just a facade on top of long.
There could be many updates in a second... In Ruby Time.now.to_f returns a Float....
I've just realized that mongo can auto generate ids, which are a bit like guids, so have a
seconds and a counter field within them.... soo... I could extract those bits and use them,
but it'd amount to much the same thing, and be a bit opaque.
Auto increment GUIDs are the best, but with timestamps, as you suggested,
there might be severa within the same resolution. One way to work around
them is the create the query with where you subtract the resolution you get
on your machine (1 milli for example), use "index" on whatever falls within
it, and use create on the rest (assuming you know that they don't exists in
elasticsearch).
By the way, the most difficult part when it comes to sync's is to handle
deletes, I assume you don't have them?
Why float? I assumed long and represent it as a timestamp. Date type in
elasticsearch is just a facade on top of long.
There could be many updates in a second... In Ruby Time.now.to_f returns a
Float....
I've just realized that mongo can auto generate ids, which are a bit like
guids, so have a
seconds and a counter field within them.... soo... I could extract those
bits and use them,
but it'd amount to much the same thing, and be a bit opaque.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.