MongoDb & ES: Where can I store timestamp for last updated document in ES?


(bcoder) #1

Let me describe my scenario:

I am using MongoDB as my primary db (we will host everything on AWS,
and our database will be sharded), and want to use Elasticsearch to
provide full-text support. I have various "types" of documents in my
database (which are updated and removed frequently), for example:
"users", "file-metadata", "log", etc, and I want to provide real time
searching support on variety of fields for all types of document
(i.e., as soon ("soon" <= 1sec) as anything in DB changes, ES server
should be updated as well). I am thinking on following strategy:

Run a daemon in background (on probably Mongodb host itself), which
keep doing this operation every second (or even less):

  1. Fetch a timestamp "T" from ES server. (where to keep it in ES, is
    what I want to ask from you guys)
  2. Find all mongodb document which are modified after "T", and push
    them to ES. (all documents in my case have a modified time field, and
    it's indexed).
  3. Update timestamp "T" on ES server to most recent value among all
    documents pushed (if none if pushed, then do nothing).
  4. Repeat 1-3 forever.

Note: The above approach do not work for document deletions in Mongodb
database, I was thinking of making a separate collection in Mongodb,
which keep track of deleted objects, so that the daemon can read it
too perpetually and make the required changes in ES server.

The question I want to ask is, where should I store value "T" in ES
server. As you can see from above design, "T" needs to be fetched and
updated very often (possibly every second). Should I keep "timestamp"
as a special document in the same index/type (where I keep my data),
and fetch for it using a special query and update it like a normal
document at the end of each push (will change in value of T cause re-
indexing every time ?). Is there rather some metadata which I can
store with a particular "index" and update/fetch frequently (without
need of indexing it) ? Please note that the value "T" is different
from when ES was last updated (the value "T" is actually - last
modified timestamp of my last mongodb document updated/added in ES).

Can someone see any obvious problem with this solution ? Is there a
better way to do this (river ? - please note, my db is sharded and
also I need to pre-process documents (like add/remove some fields,
etc) before pushing them to ES). All suggestions/comments are welcome.

Thanks!


(Radu Gheorghe) #2

Hi,

I would use a river:

And I would try to contribute to it, if it wouldn't provide what I
need.

On Apr 20, 1:41 am, bcoder blitzkriegco...@gmail.com wrote:

Let me describe my scenario:

I am using MongoDB as my primary db (we will host everything on AWS,
and our database will be sharded), and want to use Elasticsearch to
provide full-text support. I have various "types" of documents in my
database (which are updated and removed frequently), for example:
"users", "file-metadata", "log", etc, and I want to provide real time
searching support on variety of fields for all types of document
(i.e., as soon ("soon" <= 1sec) as anything in DB changes, ES server
should be updated as well). I am thinking on following strategy:

Run a daemon in background (on probably Mongodb host itself), which
keep doing this operation every second (or even less):

  1. Fetch a timestamp "T" from ES server. (where to keep it in ES, is
    what I want to ask from you guys)
  2. Find all mongodb document which are modified after "T", and push
    them to ES. (all documents in my case have a modified time field, and
    it's indexed).
  3. Update timestamp "T" on ES server to most recent value among all
    documents pushed (if none if pushed, then do nothing).
  4. Repeat 1-3 forever.

Note: The above approach do not work for document deletions in Mongodb
database, I was thinking of making a separate collection in Mongodb,
which keep track of deleted objects, so that the daemon can read it
too perpetually and make the required changes in ES server.

The question I want to ask is, where should I store value "T" in ES
server. As you can see from above design, "T" needs to be fetched and
updated very often (possibly every second). Should I keep "timestamp"
as a special document in the same index/type (where I keep my data),
and fetch for it using a special query and update it like a normal
document at the end of each push (will change in value of T cause re-
indexing every time ?). Is there rather some metadata which I can
store with a particular "index" and update/fetch frequently (without
need of indexing it) ? Please note that the value "T" is different
from when ES was last updated (the value "T" is actually - last
modified timestamp of my last mongodb document updated/added in ES).

Can someone see any obvious problem with this solution ? Is there a
better way to do this (river ? - please note, my db is sharded and
also I need to pre-process documents (like add/remove some fields,
etc) before pushing them to ES). All suggestions/comments are welcome.

Thanks!


(system) #3