Multi Data Center Deployment


(talsalmona) #1

Hi community,

I've recently started looking into deploying ElasticSearch in two data
centers.
My initial though is to build an ES cluster on each of the data
centers and in addition to have couchdb instances on each of the data
centers.
When indexing, I will push the JSON into couch and ES and will let
couch do its sync magic between the data centers and then index the
JSON again on the other end.
I can suffer some delays in sync between the data centers.

The one thing I don't really like about this solution is that I need
to build a layer ontop of ES and couch and that I need to duplicate
data.

Any thoughts?

Thanks,
Tal


(Shay Banon) #2

Hi,

Sounds like a good solution. That layer on top of couchdb is something
that I am working on as a "side" project. Its part of a larger effort I call
"indexer". I will post something on it once I have more concrete things to
show.

Multi data center is always an interesting problem for distributed
systems, and its not easy to solve it, and any solution comes with its
drawbacks. Even so called nosql solutions that can span multiple data
centers easily sacrifice things to achieve that (and no, I am not talking
about consistency here, which is eventual in those solutions, but on the
fact that you can actually loose data).

I do plan to attack this problem in elasticsearch. For now, the best
solution is doing something similar to what you suggest (couchdb) or using a
messaging later between the two clusters.

-shay.banon

On Mon, Sep 13, 2010 at 9:21 PM, Tal talsalmona@gmail.com wrote:

Hi community,

I've recently started looking into deploying ElasticSearch in two data
centers.
My initial though is to build an ES cluster on each of the data
centers and in addition to have couchdb instances on each of the data
centers.
When indexing, I will push the JSON into couch and ES and will let
couch do its sync magic between the data centers and then index the
JSON again on the other end.
I can suffer some delays in sync between the data centers.

The one thing I don't really like about this solution is that I need
to build a layer ontop of ES and couch and that I need to duplicate
data.

Any thoughts?

Thanks,
Tal


(Mahendra M) #3

Hi Shay,

On Tue, Sep 14, 2010 at 4:01 AM, Shay Banon wrote:

Sounds like a good solution. That layer on top of couchdb is something
that I am working on as a "side" project. Its part of a larger effort I call
"indexer". I will post something on it once I have more concrete things to
show.

I was also working on something similar. A listener on couchdb
_changes, which keeps syncing with ElasticSearch. Was working on an
"async" version using twisted also.

Let me know if I can help out in your project. Maybe, I can stop
working on my stuff and focus on yours instead...

Regards,
Mahendra

http://twitter.com/mahendra


(Shay Banon) #4

I am currently building the infra within elasticsearch to support indexers
(of any type) with hot failover and async state storage. Once its done, then
one such indexer can be couchdb (running from "within" elasticsearch).

-shay.banon

On Tue, Sep 14, 2010 at 7:17 AM, Mahendra M mahendra.m@gmail.com wrote:

Hi Shay,

On Tue, Sep 14, 2010 at 4:01 AM, Shay Banon wrote:

Sounds like a good solution. That layer on top of couchdb is something
that I am working on as a "side" project. Its part of a larger effort I
call
"indexer". I will post something on it once I have more concrete things
to
show.

I was also working on something similar. A listener on couchdb
_changes, which keeps syncing with ElasticSearch. Was working on an
"async" version using twisted also.

Let me know if I can help out in your project. Maybe, I can stop
working on my stuff and focus on yours instead...

Regards,
Mahendra

http://twitter.com/mahendra


(Benoit Chesneau) #5

On Sep 14, 3:31 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

I am currently building the infra within elasticsearch to support indexers
(of any type) with hot failover and async state storage. Once its done, then
one such indexer can be couchdb (running from "within" elasticsearch).

-shay.banon

On Tue, Sep 14, 2010 at 7:17 AM, Mahendra M mahendr...@gmail.com wrote:

Hi Shay,

On Tue, Sep 14, 2010 at 4:01 AM, Shay Banon wrote:

Sounds like a good solution. That layer on top of couchdb is something
that I am working on as a "side" project. Its part of a larger effort I
call
"indexer". I will post something on it once I have more concrete things
to
show.

I was also working on something similar. A listener on couchdb
_changes, which keeps syncing with ElasticSearch. Was working on an
"async" version using twisted also.

Let me know if I can help out in your project. Maybe, I can stop
working on my stuff and focus on yours instead...

Regards,
Mahendra

http://twitter.com/mahendra

Can you say more about it ? I'm working on such stuff too.

  • benoit

(Shay Banon) #6

Its hard to say much, since things are forming as I speak. In general the
idea is that there will be indexers (similar in concept to indices as
entities in the cluster, but do not hold data or shared). Those indexers are
allocate to nodes and run and their job is the index new data into
elasticsearch. elasticsearch will provide failover support for those
indexers (i.e. a node failed, they will be started on another node), and
simple state storage (index document 1000, if get restarted, need to start
reindexing from there).

An indexer will be open, with different types of indexer implementation. One
of them can be something the reads streams of couchdb changes and indexes
them. Others can listen to a rabbitmq queue and index data. Really, the sky
is the limit with the different indexers that can be implemented. I hope
also to get it open to be written easily in other languages as well (like
(j)rub, groovy, ...).

-shay.banon

On Wed, Sep 15, 2010 at 2:53 PM, Benoit Chesneau bchesneau@gmail.comwrote:

On Sep 14, 3:31 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

I am currently building the infra within elasticsearch to support
indexers
(of any type) with hot failover and async state storage. Once its done,
then
one such indexer can be couchdb (running from "within" elasticsearch).

-shay.banon

On Tue, Sep 14, 2010 at 7:17 AM, Mahendra M mahendr...@gmail.com
wrote:

Hi Shay,

On Tue, Sep 14, 2010 at 4:01 AM, Shay Banon wrote:

Sounds like a good solution. That layer on top of couchdb is
something

that I am working on as a "side" project. Its part of a larger effort
I

call

"indexer". I will post something on it once I have more concrete
things

to

show.

I was also working on something similar. A listener on couchdb
_changes, which keeps syncing with ElasticSearch. Was working on an
"async" version using twisted also.

Let me know if I can help out in your project. Maybe, I can stop
working on my stuff and focus on yours instead...

Regards,
Mahendra

http://twitter.com/mahendra

Can you say more about it ? I'm working on such stuff too.

  • benoit

(Benoit Chesneau) #7

On Sep 15, 7:44 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Its hard to say much, since things are forming as I speak. In general the
idea is that there will be indexers (similar in concept to indices as
entities in the cluster, but do not hold data or shared). Those indexers are
allocate to nodes and run and their job is the index new data into
elasticsearch. elasticsearch will provide failover support for those
indexers (i.e. a node failed, they will be started on another node), and
simple state storage (index document 1000, if get restarted, need to start
reindexing from there).

An indexer will be open, with different types of indexer implementation. One
of them can be something the reads streams of couchdb changes and indexes
them. Others can listen to a rabbitmq queue and index data. Really, the sky
is the limit with the different indexers that can be implemented. I hope
also to get it open to be written easily in other languages as well (like
(j)rub, groovy, ...).

-shay.banon

My idea is simpler here.

I want to plug the indexer in CouchDB directly here, so using the
CouchDB internal system to detect when a doc need to be indexed or
reindexed in the same day we do for views. Using BigCouch this indexer
will be run on each CouchDB nodes then put in elasitc search rather
than saving to the fs. A custom handler will allow to query
elasticsearch or anything other. Basically CouchDB is the provider and
api point.

  • benoit

(Shay Banon) #8

Sounds good. Thats another option for an integration point which can be
better (especially for systems that don't provide "changes" API). PIng us if
you make it open source, I will post it on the elasticsearch website.

-shay.banon

On Mon, Sep 20, 2010 at 7:47 AM, Benoit Chesneau bchesneau@gmail.comwrote:

On Sep 15, 7:44 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Its hard to say much, since things are forming as I speak. In general the
idea is that there will be indexers (similar in concept to indices as
entities in the cluster, but do not hold data or shared). Those indexers
are
allocate to nodes and run and their job is the index new data into
elasticsearch. elasticsearch will provide failover support for those
indexers (i.e. a node failed, they will be started on another node), and
simple state storage (index document 1000, if get restarted, need to
start
reindexing from there).

An indexer will be open, with different types of indexer implementation.
One
of them can be something the reads streams of couchdb changes and indexes
them. Others can listen to a rabbitmq queue and index data. Really, the
sky
is the limit with the different indexers that can be implemented. I hope
also to get it open to be written easily in other languages as well (like
(j)rub, groovy, ...).

-shay.banon

My idea is simpler here.

I want to plug the indexer in CouchDB directly here, so using the
CouchDB internal system to detect when a doc need to be indexed or
reindexed in the same day we do for views. Using BigCouch this indexer
will be run on each CouchDB nodes then put in elasitc search rather
than saving to the fs. A custom handler will allow to query
elasticsearch or anything other. Basically CouchDB is the provider and
api point.

  • benoit

(system) #9