Optimal setup for http only vs data nodes

Hi,

I am wondering what would be an optimal setup in terms of http-only
versus data nodes for my setup. My application is continuously
streaming data in ES for indexing at the rate of 50-200 documents per
seconds. Currently I am using a single http-only node (plus 4
data-only nodes) for both indexing and searching, both through the
REST interface. My search requests are usually very complex ans
spanning multiple indices but are not executed at a very high
frequency, typically no more that a few per seconds.

Would having 2 http-only nodes, one for indexing documents and the
other for handling search requests be a good idea
performance/scalability wise?

Thanks,
Colin

The best way to answer this is to run some load tests using your
incoming documents and your typical index queries.

We run four nodes all in a data configuration. We use the Java API for
indexing and searching. Our typical index load is 5-35 index requests
per second, this provides nearly no load. Our searches span multiple
indexes and have 10-25 criteria. Our search load is 100-150 queries/
sec against 31M docs, which we handle in the 34-40ms range.

David

Thanks for your input.

Actually, what I am really wondering is the benefit of doing the
indexing on a non-data node. The documentation
http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/
explains well the benefit of running queries on a non-data node but
for indexing, is there a significant gain to run indexing requests on
a non-data node?

Thanks,
Colin

On Mon, Jan 31, 2011 at 12:49 PM, dbenson dbenson@dbenson.net wrote:

The best way to answer this is to run some load tests using your
incoming documents and your typical index queries.

We run four nodes all in a data configuration. We use the Java API for
indexing and searching. Our typical index load is 5-35 index requests
per second, this provides nearly no load. Our searches span multiple
indexes and have 10-25 criteria. Our search load is 100-150 queries/
sec against 31M docs, which we handle in the 34-40ms range.

David

In general, its a good practice to have all your nodes data nodes. If you have a machine that is just hosting an HTTP ES proxy node (non data), then its probably wasteful, and its better to have it as a data node to shared the load.
On Monday, January 31, 2011 at 9:35 PM, Colin Surprenant wrote:

Thanks for your input.

Actually, what I am really wondering is the benefit of doing the
indexing on a non-data node. The documentation
http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/
explains well the benefit of running queries on a non-data node but
for indexing, is there a significant gain to run indexing requests on
a non-data node?

Thanks,
Colin

On Mon, Jan 31, 2011 at 12:49 PM, dbenson dbenson@dbenson.net wrote:

The best way to answer this is to run some load tests using your
incoming documents and your typical index queries.

We run four nodes all in a data configuration. We use the Java API for
indexing and searching. Our typical index load is 5-35 index requests
per second, this provides nearly no load. Our searches span multiple
indexes and have 10-25 criteria. Our search load is 100-150 queries/
sec against 31M docs, which we handle in the 34-40ms range.

David

You may want to do this for security though, no? Since ES does not have
authentication, SSL, etc. it may make sense to have a front end that is a no
data node, and keep ES data nodes behind a firewall, only accessible via the
front end.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Jan 31, 2011 at 3:11 PM, Shay Banon shay.banon@elasticsearch.comwrote:

In general, its a good practice to have all your nodes data nodes. If you
have a machine that is just hosting an HTTP ES proxy node (non data), then
its probably wasteful, and its better to have it as a data node to shared
the load.

On Monday, January 31, 2011 at 9:35 PM, Colin Surprenant wrote:

Thanks for your input.

Actually, what I am really wondering is the benefit of doing the
indexing on a non-data node. The documentation
http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/
explains well the benefit of running queries on a non-data node but
for indexing, is there a significant gain to run indexing requests on
a non-data node?

Thanks,
Colin

On Mon, Jan 31, 2011 at 12:49 PM, dbenson dbenson@dbenson.net wrote:

The best way to answer this is to run some load tests using your
incoming documents and your typical index queries.

We run four nodes all in a data configuration. We use the Java API for
indexing and searching. Our typical index load is 5-35 index requests
per second, this provides nearly no load. Our searches span multiple
indexes and have 10-25 criteria. Our search load is 100-150 queries/
sec against 31M docs, which we handle in the 34-40ms range.

David

Shay,

So, in that respect, are the benefits of using a http-only node as
described here http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/
valid?? in particular <<This relieves the data nodes to do the heavy
duty of indexing and searching, without needing to process HTTP
requests (parsing), overload the network, or perform the gather
processing>>

Thanks,
Colin

On Mon, Jan 31, 2011 at 3:11 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

In general, its a good practice to have all your nodes data nodes. If you
have a machine that is just hosting an HTTP ES proxy node (non data), then
its probably wasteful, and its better to have it as a data node to shared
the load.

On Monday, January 31, 2011 at 9:35 PM, Colin Surprenant wrote:

Thanks for your input.

Actually, what I am really wondering is the benefit of doing the
indexing on a non-data node. The documentation
http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/
explains well the benefit of running queries on a non-data node but
for indexing, is there a significant gain to run indexing requests on
a non-data node?

Thanks,
Colin

On Mon, Jan 31, 2011 at 12:49 PM, dbenson dbenson@dbenson.net wrote:

The best way to answer this is to run some load tests using your
incoming documents and your typical index queries.

We run four nodes all in a data configuration. We use the Java API for
indexing and searching. Our typical index load is 5-35 index requests
per second, this provides nearly no load. Our searches span multiple
indexes and have 10-25 criteria. Our search load is 100-150 queries/
sec against 31M docs, which we handle in the 34-40ms range.

David

It is still valid, but, if you end up having this load balancer on a single machine that does just that, its better to have it as another data node. The way that I was visioning it is you have a load balancer machine that hosts other load balancer like apache / nginx, you can have an ES load balancer as well running there.
On Wednesday, February 2, 2011 at 9:39 PM, Colin Surprenant wrote:

Shay,

So, in that respect, are the benefits of using a http-only node as
described here http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/
valid?? in particular <<This relieves the data nodes to do the heavy
duty of indexing and searching, without needing to process HTTP
requests (parsing), overload the network, or perform the gather
processing>>

Thanks,
Colin

On Mon, Jan 31, 2011 at 3:11 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

In general, its a good practice to have all your nodes data nodes. If you
have a machine that is just hosting an HTTP ES proxy node (non data), then
its probably wasteful, and its better to have it as a data node to shared
the load.

On Monday, January 31, 2011 at 9:35 PM, Colin Surprenant wrote:

Thanks for your input.

Actually, what I am really wondering is the benefit of doing the
indexing on a non-data node. The documentation
http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/
explains well the benefit of running queries on a non-data node but
for indexing, is there a significant gain to run indexing requests on
a non-data node?

Thanks,
Colin

On Mon, Jan 31, 2011 at 12:49 PM, dbenson dbenson@dbenson.net wrote:

The best way to answer this is to run some load tests using your
incoming documents and your typical index queries.

We run four nodes all in a data configuration. We use the Java API for
indexing and searching. Our typical index load is 5-35 index requests
per second, this provides nearly no load. Our searches span multiple
indexes and have 10-25 criteria. Our search load is 100-150 queries/
sec against 31M docs, which we handle in the 34-40ms range.

David