Replication Strategies

Curious, if you guys are using ElasticSearch for replication, how do you
replicate your data?

I'm assuming your main data source is slow, so I was wondering how you go
around it to have a faster replication.

Do you do your huge/slow query on your main data store and store that in
ElasticSearch, or do you guys query expensive tables and store those in
ElasticSearch, and then later recombine them? ...or? :slight_smile:

Do you guys do lazy replication wherein the first time a query is made, you
retrieve data from your main data source, then replicate in ElasticSearch?
Or do you do a scheduled replication? ...or? :slight_smile:

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

Replication means several things when it comes to elasticsearch, you mean
"replicating" changes done to the actual data source to elasticsearch.
Usually the best thing is to build a system that can apply changes done to
the data source back to elasticsearch. How you go about applying changes
done to the data source into elasticsearch can be done in several manners,
depending on the data source:

  1. If the data source has hooks that allow to be notified when something
    changes in it, then a custom elasticsearch hook can be written to apply
    those changes.

  2. If the data source provides a stream of changes done to it, that stream
    can be used to apply the changes to elasticsearch.

  3. A custom poller for changes can be written that polls the data source for
    changes (for example, based on last poll timestamp) and apply them to
    elasticsearch.

-shay.banon

On Thu, Jul 8, 2010 at 3:40 PM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

Curious, if you guys are using Elasticsearch for replication, how do you
replicate your data?

I'm assuming your main data source is slow, so I was wondering how you go
around it to have a faster replication.

Do you do your huge/slow query on your main data store and store that in
Elasticsearch, or do you guys query expensive tables and store those in
Elasticsearch, and then later recombine them? ...or? :slight_smile:

Do you guys do lazy replication wherein the first time a query is made, you
retrieve data from your main data source, then replicate in Elasticsearch?
Or do you do a scheduled replication? ...or? :slight_smile:

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

At the moment I'm using polling but since the main application is making use
of Hibernate I may try to leverage a post flush interceptor for incremental
updating.
The initial main indexing is done via straight SQL.

S.D.

On Thu, Jul 8, 2010 at 5:40 AM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

Curious, if you guys are using Elasticsearch for replication, how do you
replicate your data?

I'm assuming your main data source is slow, so I was wondering how you go
around it to have a faster replication.

Do you do your huge/slow query on your main data store and store that in
Elasticsearch, or do you guys query expensive tables and store those in
Elasticsearch, and then later recombine them? ...or? :slight_smile:

Do you guys do lazy replication wherein the first time a query is made, you
retrieve data from your main data source, then replicate in Elasticsearch?
Or do you do a scheduled replication? ...or? :slight_smile:

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

Thanks everyone.

Pardon Shay, let me elaborate :slight_smile:

Currently, my problems are on
a.) Getting the initial data from the main data source (MDS) to replicated
data source (RDS) (which is Elasticsearch), and
b.) Keeping the data in the RDS updated

For a.) these are my choices so far:
a.1.) Query from the MDS everything that I will replicate and put them all
in the RDS
a.2.) Query from the MDS upon demand the information and put that in the RDS
a.3.) Query from the MDS the heavy parts and put those in the RDS and then
query the rest of the data on demand from the MDS and combine that with the
data in the RDS, and then put the result back in the RDS (i.e. original query is select * from tbl_small, tbl_big where .... So i first replicate
tbl_big. and then when the user does an action, I will start querying from
tbl_small then combine that to the replicated tbl_big and put the result in
the query_aggregate RDS so that future queries only hit the RDS)

For b.) these are my choices so far:
b.1.) Have an event listener that when data in the MDS, the listener will
update the data in the RDS (we're currently considering hibernate listener
or db trigger)
b.2.) Have a scheduled refresh of those that changed since last refresh

But since I don't know much about Replication & Caching
Patterns/Anti-Patterns, I would like to inquire on how the community does it
:slight_smile:

Thoughts?

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

On Fri, Jul 9, 2010 at 1:03 AM, Samuel Doyle samueldoyle@gmail.com wrote:

At the moment I'm using polling but since the main application is making
use of Hibernate I may try to leverage a post flush interceptor for
incremental updating.
The initial main indexing is done via straight SQL.

S.D.

On Thu, Jul 8, 2010 at 5:40 AM, Franz Allan Valencia See <
franz.see@gmail.com> wrote:

Curious, if you guys are using Elasticsearch for replication, how do you
replicate your data?

I'm assuming your main data source is slow, so I was wondering how you go
around it to have a faster replication.

Do you do your huge/slow query on your main data store and store that in
Elasticsearch, or do you guys query expensive tables and store those in
Elasticsearch, and then later recombine them? ...or? :slight_smile:

Do you guys do lazy replication wherein the first time a query is made,
you retrieve data from your main data source, then replicate in
Elasticsearch? Or do you do a scheduled replication? ...or? :slight_smile:

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

Franz, I think you really just need to do something similar to what
DIH (DataImportHandler) in Solr does, which is roughly this:

  • Have a tool that can do a bulk import/indexing
  • Have a tool (perhaps it's the same tool) that can do incremental
    import/indexing. It can do that by keeping track of the last imported
    ID or by tracking the timestamp of the last import or some such.

The bulk import is something you'd do once, at the beginning, when
your ES index is empty. Then you would periodically run the tool to
do the incremental indexing. You would also use bulk importer/indexer
if you have to reindex from scratch for whatever reason.

The above will not get your data into ES in real-time, so if you need
real-time search, you need a real-time approach instead of the
incremental one. In that case you need to have a mechanism that
inserts individual records/documents into ES as soon as they are added
to your data store. Some data stores have hooks to make this kind of
automatic, like MongoDB, Terrastore, etc. Hibernate Search does this
for relational databases and Lucene (which is the low level Java
library ES uses for indexing/searching).

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

On Jul 8, 10:25 pm, Franz Allan Valencia See franz....@gmail.com
wrote:

Thanks everyone.

Pardon Shay, let me elaborate :slight_smile:

Currently, my problems are on
a.) Getting the initial data from the main data source (MDS) to replicated
data source (RDS) (which is Elasticsearch), and
b.) Keeping the data in the RDS updated

For a.) these are my choices so far:
a.1.) Query from the MDS everything that I will replicate and put them all
in the RDS
a.2.) Query from the MDS upon demand the information and put that in the RDS
a.3.) Query from the MDS the heavy parts and put those in the RDS and then
query the rest of the data on demand from the MDS and combine that with the
data in the RDS, and then put the result back in the RDS (i.e. original query is select * from tbl_small, tbl_big where .... So i first replicate
tbl_big. and then when the user does an action, I will start querying from
tbl_small then combine that to the replicated tbl_big and put the result in
the query_aggregate RDS so that future queries only hit the RDS)

For b.) these are my choices so far:
b.1.) Have an event listener that when data in the MDS, the listener will
update the data in the RDS (we're currently considering hibernate listener
or db trigger)
b.2.) Have a scheduled refresh of those that changed since last refresh

But since I don't know much about Replication & Caching
Patterns/Anti-Patterns, I would like to inquire on how the community does it
:slight_smile:

Thoughts?

--
Franz Allan Valencia See | Java Software Engineer
franz....@gmail.com
LinkedIn:http://www.linkedin.com/in/franzsee
Twitter:http://www.twitter.com/franz_see

On Fri, Jul 9, 2010 at 1:03 AM, Samuel Doyle samueldo...@gmail.com wrote:

At the moment I'm using polling but since the main application is making
use of Hibernate I may try to leverage a post flush interceptor for
incremental updating.
The initial main indexing is done via straight SQL.

S.D.

On Thu, Jul 8, 2010 at 5:40 AM, Franz Allan Valencia See <
franz....@gmail.com> wrote:

Curious, if you guys are using Elasticsearch for replication, how do you
replicate your data?

I'm assuming your main data source is slow, so I was wondering how you go
around it to have a faster replication.

Do you do your huge/slow query on your main data store and store that in
Elasticsearch, or do you guys query expensive tables and store those in
Elasticsearch, and then later recombine them? ...or? :slight_smile:

Do you guys do lazy replication wherein the first time a query is made,
you retrieve data from your main data source, then replicate in
Elasticsearch? Or do you do a scheduled replication? ...or? :slight_smile:

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
franz....@gmail.com
LinkedIn:http://www.linkedin.com/in/franzsee
Twitter:http://www.twitter.com/franz_see

I guess I'm in the right path then.

Thanks,

Franz Allan Valencia See | Java Software Engineer
franz.see@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

On Sat, Jul 10, 2010 at 4:22 AM, Otis otis.gospodnetic@gmail.com wrote:

Franz, I think you really just need to do something similar to what
DIH (DataImportHandler) in Solr does, which is roughly this:

  • Have a tool that can do a bulk import/indexing
  • Have a tool (perhaps it's the same tool) that can do incremental
    import/indexing. It can do that by keeping track of the last imported
    ID or by tracking the timestamp of the last import or some such.

The bulk import is something you'd do once, at the beginning, when
your ES index is empty. Then you would periodically run the tool to
do the incremental indexing. You would also use bulk importer/indexer
if you have to reindex from scratch for whatever reason.

The above will not get your data into ES in real-time, so if you need
real-time search, you need a real-time approach instead of the
incremental one. In that case you need to have a mechanism that
inserts individual records/documents into ES as soon as they are added
to your data store. Some data stores have hooks to make this kind of
automatic, like MongoDB, Terrastore, etc. Hibernate Search does this
for relational databases and Lucene (which is the low level Java
library ES uses for indexing/searching).

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

On Jul 8, 10:25 pm, Franz Allan Valencia See franz....@gmail.com
wrote:

Thanks everyone.

Pardon Shay, let me elaborate :slight_smile:

Currently, my problems are on
a.) Getting the initial data from the main data source (MDS) to
replicated
data source (RDS) (which is Elasticsearch), and
b.) Keeping the data in the RDS updated

For a.) these are my choices so far:
a.1.) Query from the MDS everything that I will replicate and put them
all
in the RDS
a.2.) Query from the MDS upon demand the information and put that in the
RDS
a.3.) Query from the MDS the heavy parts and put those in the RDS and
then
query the rest of the data on demand from the MDS and combine that with
the
data in the RDS, and then put the result back in the RDS (i.e. original query is select * from tbl_small, tbl_big where .... So i first
replicate
tbl_big. and then when the user does an action, I will start querying
from
tbl_small then combine that to the replicated tbl_big and put the result
in
the query_aggregate RDS so that future queries only hit the RDS)

For b.) these are my choices so far:
b.1.) Have an event listener that when data in the MDS, the listener will
update the data in the RDS (we're currently considering hibernate
listener
or db trigger)
b.2.) Have a scheduled refresh of those that changed since last refresh

But since I don't know much about Replication & Caching
Patterns/Anti-Patterns, I would like to inquire on how the community does
it
:slight_smile:

Thoughts?

--
Franz Allan Valencia See | Java Software Engineer
franz....@gmail.com
LinkedIn:http://www.linkedin.com/in/franzsee
Twitter:http://www.twitter.com/franz_see

On Fri, Jul 9, 2010 at 1:03 AM, Samuel Doyle samueldo...@gmail.com
wrote:

At the moment I'm using polling but since the main application is
making
use of Hibernate I may try to leverage a post flush interceptor for
incremental updating.
The initial main indexing is done via straight SQL.

S.D.

On Thu, Jul 8, 2010 at 5:40 AM, Franz Allan Valencia See <
franz....@gmail.com> wrote:

Curious, if you guys are using Elasticsearch for replication, how do
you
replicate your data?

I'm assuming your main data source is slow, so I was wondering how you
go
around it to have a faster replication.

Do you do your huge/slow query on your main data store and store that
in
Elasticsearch, or do you guys query expensive tables and store those
in
Elasticsearch, and then later recombine them? ...or? :slight_smile:

Do you guys do lazy replication wherein the first time a query is
made,
you retrieve data from your main data source, then replicate in
Elasticsearch? Or do you do a scheduled replication? ...or? :slight_smile:

Thanks,

--
Franz Allan Valencia See | Java Software Engineer
franz....@gmail.com
LinkedIn:http://www.linkedin.com/in/franzsee
Twitter:http://www.twitter.com/franz_see