Franz, I think you really just need to do something similar to what
DIH (DataImportHandler) in Solr does, which is roughly this:
- Have a tool that can do a bulk import/indexing
- Have a tool (perhaps it's the same tool) that can do incremental
import/indexing. It can do that by keeping track of the last imported
ID or by tracking the timestamp of the last import or some such.
The bulk import is something you'd do once, at the beginning, when
your ES index is empty. Then you would periodically run the tool to
do the incremental indexing. You would also use bulk importer/indexer
if you have to reindex from scratch for whatever reason.
The above will not get your data into ES in real-time, so if you need
real-time search, you need a real-time approach instead of the
incremental one. In that case you need to have a mechanism that
inserts individual records/documents into ES as soon as they are added
to your data store. Some data stores have hooks to make this kind of
automatic, like MongoDB, Terrastore, etc. Hibernate Search does this
for relational databases and Lucene (which is the low level Java
library ES uses for indexing/searching).
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
On Jul 8, 10:25 pm, Franz Allan Valencia See franz....@gmail.com
wrote:
Thanks everyone.
Pardon Shay, let me elaborate
Currently, my problems are on
a.) Getting the initial data from the main data source (MDS) to replicated
data source (RDS) (which is Elasticsearch), and
b.) Keeping the data in the RDS updated
For a.) these are my choices so far:
a.1.) Query from the MDS everything that I will replicate and put them all
in the RDS
a.2.) Query from the MDS upon demand the information and put that in the RDS
a.3.) Query from the MDS the heavy parts and put those in the RDS and then
query the rest of the data on demand from the MDS and combine that with the
data in the RDS, and then put the result back in the RDS (i.e. original query is select * from tbl_small, tbl_big where ...
. So i first replicate
tbl_big
. and then when the user does an action, I will start querying from
tbl_small then combine that to the replicated tbl_big and put the result in
the query_aggregate RDS so that future queries only hit the RDS)
For b.) these are my choices so far:
b.1.) Have an event listener that when data in the MDS, the listener will
update the data in the RDS (we're currently considering hibernate listener
or db trigger)
b.2.) Have a scheduled refresh of those that changed since last refresh
But since I don't know much about Replication & Caching
Patterns/Anti-Patterns, I would like to inquire on how the community does it
Thoughts?
--
Franz Allan Valencia See | Java Software Engineer
franz....@gmail.com
LinkedIn:http://www.linkedin.com/in/franzsee
Twitter:http://www.twitter.com/franz_see
On Fri, Jul 9, 2010 at 1:03 AM, Samuel Doyle samueldo...@gmail.com wrote:
At the moment I'm using polling but since the main application is making
use of Hibernate I may try to leverage a post flush interceptor for
incremental updating.
The initial main indexing is done via straight SQL.
S.D.
On Thu, Jul 8, 2010 at 5:40 AM, Franz Allan Valencia See <
franz....@gmail.com> wrote:
Curious, if you guys are using Elasticsearch for replication, how do you
replicate your data?
I'm assuming your main data source is slow, so I was wondering how you go
around it to have a faster replication.
Do you do your huge/slow query on your main data store and store that in
Elasticsearch, or do you guys query expensive tables and store those in
Elasticsearch, and then later recombine them? ...or?
Do you guys do lazy replication wherein the first time a query is made,
you retrieve data from your main data source, then replicate in
Elasticsearch? Or do you do a scheduled replication? ...or?
Thanks,
--
Franz Allan Valencia See | Java Software Engineer
franz....@gmail.com
LinkedIn:http://www.linkedin.com/in/franzsee
Twitter:http://www.twitter.com/franz_see