Elasticsearch & mongo - sync


(John Merrells) #1

Hello,

I have a bunch of documents stored in mongo that I'm indexing with elasticsearch.
Would there happen to be a handy sync script/tool to keep elastic search up to
date wrt a mongo container...? Anyone on the list been down this road before and
care to share some wisdom?

John


(Shay Banon) #2

I know that some expressed interest in it. I don't have much
mongo experience (actually, none...), but some suggestions can be either
hook into a "post commit" hook in mongo, or query it in some way for latest
changes? Another option, which is more on the application side, is to apply
the same changes done to mongo to elasticsearch as well. Sure you thought of
these already ...

-shay.banon

On Tue, Aug 3, 2010 at 10:01 PM, John Merrells merrells@gmail.com wrote:

Hello,

I have a bunch of documents stored in mongo that I'm indexing with
elasticsearch.
Would there happen to be a handy sync script/tool to keep elastic search up
to
date wrt a mongo container...? Anyone on the list been down this road
before and
care to share some wisdom?

John


(John Merrells) #3

On Aug 3, 2010, at 10:55 PM, Shay Banon wrote:

I know that some expressed interest in it. I don't have much mongo experience (actually, none...), but some suggestions can be either hook into a "post commit" hook in mongo, or query it in some way for latest changes? Another option, which is more on the application side, is to apply the same changes done to mongo to elasticsearch as well. Sure you thought of these already ...

My curent plan is to have an updated_at datetime field on each document.
With that I can ask elastic search what the max datetime it has and then on
mongo ask for all the documents with datetime greater than that. The only
twisty bit so far is that the datetime is going to need to be stored as a string
on mongo, and a long on elasticsearch, but in theory it should work.

John


(Shay Banon) #4

Sounds good. There is no long type in mongo?

On Wed, Aug 4, 2010 at 4:26 PM, John Merrells merrells@gmail.com wrote:

On Aug 3, 2010, at 10:55 PM, Shay Banon wrote:

I know that some expressed interest in it. I don't have much mongo
experience (actually, none...), but some suggestions can be either hook into
a "post commit" hook in mongo, or query it in some way for latest changes?
Another option, which is more on the application side, is to apply the same
changes done to mongo to elasticsearch as well. Sure you thought of these
already ...

My curent plan is to have an updated_at datetime field on each document.
With that I can ask elastic search what the max datetime it has and then on
mongo ask for all the documents with datetime greater than that. The only
twisty bit so far is that the datetime is going to need to be stored as a
string
on mongo, and a long on elasticsearch, but in theory it should work.

John


(John Merrells) #5

On Aug 4, 2010, at 6:31 AM, Shay Banon wrote:

Sounds good. There is no long type in mongo?

Doh. Yes, it does. That'd be simpler.

I got there because I started looking at the Date type, but on mongo
its hard to query it through the protocol, and on elasticsearch there's
no max for date... so then I tried string.... then long.... so yeah I should
just use long on both....

John


(John Merrells) #6

On Aug 4, 2010, at 6:31 AM, Shay Banon wrote:

Sounds good. There is no long type in mongo?

Sorry to bore everyone with my early morning pre-coffee mumblings,
but 'float' on both mongo and elastic is most probably the way to go.

John


(Shay Banon) #7

Why float? I assumed long and represent it as a timestamp. Date type in
elasticsearch is just a facade on top of long.

-shay.bano

On Wed, Aug 4, 2010 at 4:48 PM, John Merrells merrells@gmail.com wrote:

On Aug 4, 2010, at 6:31 AM, Shay Banon wrote:

Sounds good. There is no long type in mongo?

Sorry to bore everyone with my early morning pre-coffee mumblings,
but 'float' on both mongo and elastic is most probably the way to go.

John


(John Merrells) #8

On Aug 4, 2010, at 6:49 AM, Shay Banon wrote:

Why float? I assumed long and represent it as a timestamp. Date type in elasticsearch is just a facade on top of long.

There could be many updates in a second... In Ruby Time.now.to_f returns a Float....

I've just realized that mongo can auto generate ids, which are a bit like guids, so have a
seconds and a counter field within them.... soo... I could extract those bits and use them,
but it'd amount to much the same thing, and be a bit opaque.

Still pre coffee.... still rambling....

John


(Shay Banon) #9

Auto increment GUIDs are the best, but with timestamps, as you suggested,
there might be severa within the same resolution. One way to work around
them is the create the query with where you subtract the resolution you get
on your machine (1 milli for example), use "index" on whatever falls within
it, and use create on the rest (assuming you know that they don't exists in
elasticsearch).

By the way, the most difficult part when it comes to sync's is to handle
deletes, I assume you don't have them?

-shay.banon

On Wed, Aug 4, 2010 at 4:59 PM, John Merrells merrells@gmail.com wrote:

On Aug 4, 2010, at 6:49 AM, Shay Banon wrote:

Why float? I assumed long and represent it as a timestamp. Date type in
elasticsearch is just a facade on top of long.

There could be many updates in a second... In Ruby Time.now.to_f returns a
Float....

I've just realized that mongo can auto generate ids, which are a bit like
guids, so have a
seconds and a counter field within them.... soo... I could extract those
bits and use them,
but it'd amount to much the same thing, and be a bit opaque.

Still pre coffee.... still rambling....

John


(John Merrells) #10

On Aug 4, 2010, at 7:03 AM, Shay Banon wrote:

By the way, the most difficult part when it comes to sync's is to handle deletes, I assume you don't have them?

No deletes.

John


(system) #11