I am using MongoDB as my primary db (we will host everything on AWS,
and our database will be sharded), and want to use Elasticsearch to
provide full-text support. I have various "types" of documents in my
database (which are updated and removed frequently), for example:
"users", "file-metadata", "log", etc, and I want to provide real time
searching support on variety of fields for all types of document
(i.e., as soon ("soon" <= 1sec) as anything in DB changes, ES server
should be updated as well). I am thinking on following strategy:
Run a daemon in background (on probably Mongodb host itself), which
keep doing this operation every second (or even less):
Fetch a timestamp "T" from ES server. (where to keep it in ES, is
what I want to ask from you guys)
Find all mongodb document which are modified after "T", and push
them to ES. (all documents in my case have a modified time field, and
it's indexed).
Update timestamp "T" on ES server to most recent value among all
documents pushed (if none if pushed, then do nothing).
Repeat 1-3 forever.
Note: The above approach do not work for document deletions in Mongodb
database, I was thinking of making a separate collection in Mongodb,
which keep track of deleted objects, so that the daemon can read it
too perpetually and make the required changes in ES server.
The question I want to ask is, where should I store value "T" in ES
server. As you can see from above design, "T" needs to be fetched and
updated very often (possibly every second). Should I keep "timestamp"
as a special document in the same index/type (where I keep my data),
and fetch for it using a special query and update it like a normal
document at the end of each push (will change in value of T cause re-
indexing every time ?). Is there rather some metadata which I can
store with a particular "index" and update/fetch frequently (without
need of indexing it) ? Please note that the value "T" is different
from when ES was last updated (the value "T" is actually - last
modified timestamp of my last mongodb document updated/added in ES).
Can someone see any obvious problem with this solution ? Is there a
better way to do this (river ? - please note, my db is sharded and
also I need to pre-process documents (like add/remove some fields,
etc) before pushing them to ES). All suggestions/comments are welcome.
I am using MongoDB as my primary db (we will host everything on AWS,
and our database will be sharded), and want to use Elasticsearch to
provide full-text support. I have various "types" of documents in my
database (which are updated and removed frequently), for example:
"users", "file-metadata", "log", etc, and I want to provide real time
searching support on variety of fields for all types of document
(i.e., as soon ("soon" <= 1sec) as anything in DB changes, ES server
should be updated as well). I am thinking on following strategy:
Run a daemon in background (on probably Mongodb host itself), which
keep doing this operation every second (or even less):
Fetch a timestamp "T" from ES server. (where to keep it in ES, is
what I want to ask from you guys)
Find all mongodb document which are modified after "T", and push
them to ES. (all documents in my case have a modified time field, and
it's indexed).
Update timestamp "T" on ES server to most recent value among all
documents pushed (if none if pushed, then do nothing).
Repeat 1-3 forever.
Note: The above approach do not work for document deletions in Mongodb
database, I was thinking of making a separate collection in Mongodb,
which keep track of deleted objects, so that the daemon can read it
too perpetually and make the required changes in ES server.
The question I want to ask is, where should I store value "T" in ES
server. As you can see from above design, "T" needs to be fetched and
updated very often (possibly every second). Should I keep "timestamp"
as a special document in the same index/type (where I keep my data),
and fetch for it using a special query and update it like a normal
document at the end of each push (will change in value of T cause re-
indexing every time ?). Is there rather some metadata which I can
store with a particular "index" and update/fetch frequently (without
need of indexing it) ? Please note that the value "T" is different
from when ES was last updated (the value "T" is actually - last
modified timestamp of my last mongodb document updated/added in ES).
Can someone see any obvious problem with this solution ? Is there a
better way to do this (river ? - please note, my db is sharded and
also I need to pre-process documents (like add/remove some fields,
etc) before pushing them to ES). All suggestions/comments are welcome.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.