Server mapping between elasticsearch and mongodb

we would like to build elasticsearch on top of our mongodb, and river
plug is used.

-what would be the best server mapping between elasticsearch and
mongodb?
1 elasticsearch <-> 1 mongo replication server?

thanks,

--

Hello Robbie,

It all depends on a lot of factors, besides the raw volume of data. For example:

  • how your data looks like
  • what hardware you run it on
  • how your queries look like
  • how often you query it
  • how often you insert/update/delete data

I'd start with one ES server on a test environment and see how it
goes. Monitor its performance and tune your settings if it's
necessary. You can always post here if you run into trouble. Only
after that I'd add nodes :slight_smile:

There are quite a few tools out there for monitoring ES, and obviously
I'd prefer ours :smiley:

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Tue, Oct 23, 2012 at 1:54 PM, Robbie Cheng robbiecheng@gmail.com wrote:

we would like to build elasticsearch on top of our mongodb, and river
plug is used.

-what would be the best server mapping between elasticsearch and
mongodb?
1 elasticsearch <-> 1 mongo replication server?

thanks,

--

Hi Radu,

Thanks for your detailed response.

Currently, we're running a mongo cluster which contains 1 router, 3
config, and 3 shards servers.
According to river's documentation, it only works with mongo
replication set. Does it imply that it has to sit on top of each
replication set?
so, the infrastructure might looks like,

router -> config
-> replica 1 ->(river) es1
-> replica 2 ->(river) es2

-Is there any easy way we can determine the HW spec we need to serve
our requirements?
-what would be the end point of es for application based on above
assumption?

Thanks in advance,

On Oct 23, 8:23 pm, Radu Gheorghe radu.gheor...@sematext.com wrote:

Hello Robbie,

It all depends on a lot of factors, besides the raw volume of data. For example:

  • how your data looks like
  • what hardware you run it on
  • how your queries look like
  • how often you query it
  • how often you insert/update/delete data

I'd start with one ES server on a test environment and see how it
goes. Monitor its performance and tune your settings if it's
necessary. You can always post here if you run into trouble. Only
after that I'd add nodes :slight_smile:

There are quite a few tools out there for monitoring ES, and obviously
I'd prefer ours :Dhttp://sematext.com/spm/elasticsearch-performance-monitoring/index.html

Best regards,
Radu
--http://sematext.com/-- ElasticSearch -- Solr -- Lucene

On Tue, Oct 23, 2012 at 1:54 PM, Robbie Cheng robbiech...@gmail.com wrote:

we would like to build elasticsearch on top of our mongodb, and river
plug is used.

-what would be the best server mapping between elasticsearch and
mongodb?
1 elasticsearch <-> 1 mongo replication server?

thanks,

--

Hello Robbie,

On Tue, Oct 23, 2012 at 7:57 PM, Robbie Cheng robbiecheng@gmail.com wrote:

Hi Radu,

Thanks for your detailed response.

You're welcome :slight_smile:

Currently, we're running a mongo cluster which contains 1 router, 3
config, and 3 shards servers.
According to river's documentation, it only works with mongo
replication set. Does it imply that it has to sit on top of each
replication set?

I haven't worked with the mongodb river (yet :p) but as far as I
understand, it works with the operations log. If you have 3 shards and
one replica set for each shard, then I suppose you need to look at all
3 oplogs in order to fetch all the data.

So yes, you should need 3 rivers, but I'm not 100% sure. Maybe someone
else could join in and confirm/deny.

so, the infrastructure might looks like,

router -> config
-> replica 1 ->(river) es1
-> replica 2 ->(river) es2

-Is there any easy way we can determine the HW spec we need to serve
our requirements?

Not that I'm aware of. The best you could do is to take a test machine
and do some tests with a subset (say, 10%) of your data. Then you can
check the performance and estimate (roughly) what you need to 10x the
original size.

Since you have this sharding, you could try with one river and one
replica set on one ES node and see how it goes.

-what would be the end point of es for application based on above
assumption?

You mean, when you have multiple ES nodes? By default, each node can
be your endpoint. Every node automatically "routes" the request to
shards hosted on other nodes if it's necessary. You can change these
settings - more info here:
http://www.elasticsearch.org/guide/reference/modules/node.html

How you connect to that API depends on how you build your application.
You can use the Java API:
http://www.elasticsearch.org/guide/reference/java-api/

Or you can use the REST API over HTTP, Thrift, memcached or whatever
transport plugin you can find or build:
http://www.elasticsearch.org/guide/reference/api/

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

Thanks in advance,

On Oct 23, 8:23 pm, Radu Gheorghe radu.gheor...@sematext.com wrote:

Hello Robbie,

It all depends on a lot of factors, besides the raw volume of data. For example:

  • how your data looks like
  • what hardware you run it on
  • how your queries look like
  • how often you query it
  • how often you insert/update/delete data

I'd start with one ES server on a test environment and see how it
goes. Monitor its performance and tune your settings if it's
necessary. You can always post here if you run into trouble. Only
after that I'd add nodes :slight_smile:

There are quite a few tools out there for monitoring ES, and obviously
I'd prefer ours :Dhttp://sematext.com/spm/elasticsearch-performance-monitoring/index.html

Best regards,
Radu
--http://sematext.com/-- ElasticSearch -- Solr -- Lucene

On Tue, Oct 23, 2012 at 1:54 PM, Robbie Cheng robbiech...@gmail.com wrote:

we would like to build elasticsearch on top of our mongodb, and river
plug is used.

-what would be the best server mapping between elasticsearch and
mongodb?
1 elasticsearch <-> 1 mongo replication server?

thanks,

--

--

Hi,

Sharded collection has not been implemented in the river yet.
I hope to look at this feature very soon.

Thanks,
Richard.

On Wednesday, October 24, 2012 6:03:08 AM UTC-4, Radu Gheorghe wrote:

Hello Robbie,

On Tue, Oct 23, 2012 at 7:57 PM, Robbie Cheng <robbi...@gmail.com<javascript:>>
wrote:

Hi Radu,

Thanks for your detailed response.

You're welcome :slight_smile:

Currently, we're running a mongo cluster which contains 1 router, 3
config, and 3 shards servers.
According to river's documentation, it only works with mongo
replication set. Does it imply that it has to sit on top of each
replication set?

I haven't worked with the mongodb river (yet :p) but as far as I
understand, it works with the operations log. If you have 3 shards and
one replica set for each shard, then I suppose you need to look at all
3 oplogs in order to fetch all the data.

So yes, you should need 3 rivers, but I'm not 100% sure. Maybe someone
else could join in and confirm/deny.

so, the infrastructure might looks like,

router -> config
-> replica 1 ->(river) es1
-> replica 2 ->(river) es2

-Is there any easy way we can determine the HW spec we need to serve
our requirements?

Not that I'm aware of. The best you could do is to take a test machine
and do some tests with a subset (say, 10%) of your data. Then you can
check the performance and estimate (roughly) what you need to 10x the
original size.

Since you have this sharding, you could try with one river and one
replica set on one ES node and see how it goes.

-what would be the end point of es for application based on above
assumption?

You mean, when you have multiple ES nodes? By default, each node can
be your endpoint. Every node automatically "routes" the request to
shards hosted on other nodes if it's necessary. You can change these
settings - more info here:
http://www.elasticsearch.org/guide/reference/modules/node.html

How you connect to that API depends on how you build your application.
You can use the Java API:
http://www.elasticsearch.org/guide/reference/java-api/

Or you can use the REST API over HTTP, Thrift, memcached or whatever
transport plugin you can find or build:
http://www.elasticsearch.org/guide/reference/api/

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

Thanks in advance,

On Oct 23, 8:23 pm, Radu Gheorghe radu.gheor...@sematext.com wrote:

Hello Robbie,

It all depends on a lot of factors, besides the raw volume of data. For
example:

  • how your data looks like
  • what hardware you run it on
  • how your queries look like
  • how often you query it
  • how often you insert/update/delete data

I'd start with one ES server on a test environment and see how it
goes. Monitor its performance and tune your settings if it's
necessary. You can always post here if you run into trouble. Only
after that I'd add nodes :slight_smile:

There are quite a few tools out there for monitoring ES, and obviously
I'd prefer ours :Dhttp://
sematext.com/spm/elasticsearch-performance-monitoring/index.html

Best regards,
Radu
--http://sematext.com/-- ElasticSearch -- Solr -- Lucene

On Tue, Oct 23, 2012 at 1:54 PM, Robbie Cheng robbiech...@gmail.com
wrote:

we would like to build elasticsearch on top of our mongodb, and river
plug is used.

-what would be the best server mapping between elasticsearch and
mongodb?
1 elasticsearch <-> 1 mongo replication server?

thanks,

--

--

Hi Richard,

Can explain a bit about this part, I thought river just pulled data from
oplog of mongodb. Does whether we use shard or matter make any difference?

Thanks,

Richard Louapre於 2012年11月5日星期一UTC+8下午4時38分07秒寫道:

Hi,

Sharded collection has not been implemented in the river yet.
I hope to look at this feature very soon.

Thanks,
Richard.

On Wednesday, October 24, 2012 6:03:08 AM UTC-4, Radu Gheorghe wrote:

Hello Robbie,

On Tue, Oct 23, 2012 at 7:57 PM, Robbie Cheng robbi...@gmail.com
wrote:

Hi Radu,

Thanks for your detailed response.

You're welcome :slight_smile:

Currently, we're running a mongo cluster which contains 1 router, 3
config, and 3 shards servers.
According to river's documentation, it only works with mongo
replication set. Does it imply that it has to sit on top of each
replication set?

I haven't worked with the mongodb river (yet :p) but as far as I
understand, it works with the operations log. If you have 3 shards and
one replica set for each shard, then I suppose you need to look at all
3 oplogs in order to fetch all the data.

So yes, you should need 3 rivers, but I'm not 100% sure. Maybe someone
else could join in and confirm/deny.

so, the infrastructure might looks like,

router -> config
-> replica 1 ->(river) es1
-> replica 2 ->(river) es2

-Is there any easy way we can determine the HW spec we need to serve
our requirements?

Not that I'm aware of. The best you could do is to take a test machine
and do some tests with a subset (say, 10%) of your data. Then you can
check the performance and estimate (roughly) what you need to 10x the
original size.

Since you have this sharding, you could try with one river and one
replica set on one ES node and see how it goes.

-what would be the end point of es for application based on above
assumption?

You mean, when you have multiple ES nodes? By default, each node can
be your endpoint. Every node automatically "routes" the request to
shards hosted on other nodes if it's necessary. You can change these
settings - more info here:
http://www.elasticsearch.org/guide/reference/modules/node.html

How you connect to that API depends on how you build your application.
You can use the Java API:
http://www.elasticsearch.org/guide/reference/java-api/

Or you can use the REST API over HTTP, Thrift, memcached or whatever
transport plugin you can find or build:
http://www.elasticsearch.org/guide/reference/api/

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

Thanks in advance,

On Oct 23, 8:23 pm, Radu Gheorghe radu.gheor...@sematext.com wrote:

Hello Robbie,

It all depends on a lot of factors, besides the raw volume of data.
For example:

  • how your data looks like
  • what hardware you run it on
  • how your queries look like
  • how often you query it
  • how often you insert/update/delete data

I'd start with one ES server on a test environment and see how it
goes. Monitor its performance and tune your settings if it's
necessary. You can always post here if you run into trouble. Only
after that I'd add nodes :slight_smile:

There are quite a few tools out there for monitoring ES, and obviously
I'd prefer ours :Dhttp://
sematext.com/spm/elasticsearch-performance-monitoring/index.html

Best regards,
Radu
--http://sematext.com/-- ElasticSearch -- Solr -- Lucene

On Tue, Oct 23, 2012 at 1:54 PM, Robbie Cheng robbiech...@gmail.com
wrote:

we would like to build elasticsearch on top of our mongodb, and
river

plug is used.

-what would be the best server mapping between elasticsearch and
mongodb?
1 elasticsearch <-> 1 mongo replication server?

thanks,

--

--

Hi Robbie,

As Radu said in sharded environment the logic is different the river should
looks at the oplog of each shard and collect the data from there.

Basically the river will need to connect to mongos instance, analyze the
sharding settings then connect to all shards and monitor the oplog.
That is currently not supported yet.

The current implementation only connect to a single mongod instance (in
replica set mode).

I am currently testing sharded environment and hope to get it available
soon.

Thanks,
Richard.

On Wednesday, November 21, 2012 8:32:15 PM UTC-5, Robbie Cheng wrote:

Hi Richard,

Can explain a bit about this part, I thought river just pulled data from
oplog of mongodb. Does whether we use shard or matter make any difference?

Thanks,

Richard Louapre於 2012年11月5日星期一UTC+8下午4時38分07秒寫道:

Hi,

Sharded collection has not been implemented in the river yet.
I hope to look at this feature very soon.

Thanks,
Richard.

On Wednesday, October 24, 2012 6:03:08 AM UTC-4, Radu Gheorghe wrote:

Hello Robbie,

On Tue, Oct 23, 2012 at 7:57 PM, Robbie Cheng robbi...@gmail.com
wrote:

Hi Radu,

Thanks for your detailed response.

You're welcome :slight_smile:

Currently, we're running a mongo cluster which contains 1 router, 3
config, and 3 shards servers.
According to river's documentation, it only works with mongo
replication set. Does it imply that it has to sit on top of each
replication set?

I haven't worked with the mongodb river (yet :p) but as far as I
understand, it works with the operations log. If you have 3 shards and
one replica set for each shard, then I suppose you need to look at all
3 oplogs in order to fetch all the data.

So yes, you should need 3 rivers, but I'm not 100% sure. Maybe someone
else could join in and confirm/deny.

so, the infrastructure might looks like,

router -> config
-> replica 1 ->(river) es1
-> replica 2 ->(river) es2

-Is there any easy way we can determine the HW spec we need to serve
our requirements?

Not that I'm aware of. The best you could do is to take a test machine
and do some tests with a subset (say, 10%) of your data. Then you can
check the performance and estimate (roughly) what you need to 10x the
original size.

Since you have this sharding, you could try with one river and one
replica set on one ES node and see how it goes.

-what would be the end point of es for application based on above
assumption?

You mean, when you have multiple ES nodes? By default, each node can
be your endpoint. Every node automatically "routes" the request to
shards hosted on other nodes if it's necessary. You can change these
settings - more info here:
http://www.elasticsearch.org/guide/reference/modules/node.html

How you connect to that API depends on how you build your application.
You can use the Java API:
http://www.elasticsearch.org/guide/reference/java-api/

Or you can use the REST API over HTTP, Thrift, memcached or whatever
transport plugin you can find or build:
http://www.elasticsearch.org/guide/reference/api/

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

Thanks in advance,

On Oct 23, 8:23 pm, Radu Gheorghe radu.gheor...@sematext.com wrote:

Hello Robbie,

It all depends on a lot of factors, besides the raw volume of data.
For example:

  • how your data looks like
  • what hardware you run it on
  • how your queries look like
  • how often you query it
  • how often you insert/update/delete data

I'd start with one ES server on a test environment and see how it
goes. Monitor its performance and tune your settings if it's
necessary. You can always post here if you run into trouble. Only
after that I'd add nodes :slight_smile:

There are quite a few tools out there for monitoring ES, and
obviously

I'd prefer ours :Dhttp://
sematext.com/spm/elasticsearch-performance-monitoring/index.html

Best regards,
Radu
--http://sematext.com/-- ElasticSearch -- Solr -- Lucene

On Tue, Oct 23, 2012 at 1:54 PM, Robbie Cheng robbiech...@gmail.com
wrote:

we would like to build elasticsearch on top of our mongodb, and
river

plug is used.

-what would be the best server mapping between elasticsearch and
mongodb?
1 elasticsearch <-> 1 mongo replication server?

thanks,

--

--

Hi Richard,

Thanks for your clarification. So, in the sharded env of mongodb, the mapping between shard server and river is 1-1 mapping.

But if it possible that 1 river maps to all shards, and pull their data all together? Any side effect?

By the way, river shall pull existing data from shard servers, rite?

Last, any update from hooking river with router directly?

Thanks,
Robbie

--