Optimal setup for http only vs data nodes

colinsurprenant · January 31, 2011, 4:22pm

Hi,

I am wondering what would be an optimal setup in terms of http-only
versus data nodes for my setup. My application is continuously
streaming data in ES for indexing at the rate of 50-200 documents per
seconds. Currently I am using a single http-only node (plus 4
data-only nodes) for both indexing and searching, both through the
REST interface. My search requests are usually very complex ans
spanning multiple indices but are not executed at a very high
frequency, typically no more that a few per seconds.

Would having 2 http-only nodes, one for indexing documents and the
other for handling search requests be a good idea
performance/scalability wise?

Thanks,
Colin

dbenson · January 31, 2011, 5:49pm

The best way to answer this is to run some load tests using your
incoming documents and your typical index queries.

We run four nodes all in a data configuration. We use the Java API for
indexing and searching. Our typical index load is 5-35 index requests
per second, this provides nearly no load. Our searches span multiple
indexes and have 10-25 criteria. Our search load is 100-150 queries/
sec against 31M docs, which we handle in the 34-40ms range.

David

colinsurprenant · January 31, 2011, 7:35pm

Thanks for your input.

Actually, what I am really wondering is the benefit of doing the
indexing on a non-data node. The documentation
http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/
explains well the benefit of running queries on a non-data node but
for indexing, is there a significant gain to run indexing requests on
a non-data node?

Thanks,
Colin

On Mon, Jan 31, 2011 at 12:49 PM, dbenson dbenson@dbenson.net wrote:

The best way to answer this is to run some load tests using your
incoming documents and your typical index queries.

We run four nodes all in a data configuration. We use the Java API for
indexing and searching. Our typical index load is 5-35 index requests
per second, this provides nearly no load. Our searches span multiple
indexes and have 10-25 criteria. Our search load is 100-150 queries/
sec against 31M docs, which we handle in the 34-40ms range.

David

kimchy · January 31, 2011, 8:11pm

In general, its a good practice to have all your nodes data nodes. If you have a machine that is just hosting an HTTP ES proxy node (non data), then its probably wasteful, and its better to have it as a data node to shared the load.
On Monday, January 31, 2011 at 9:35 PM, Colin Surprenant wrote:

Thanks for your input.

Actually, what I am really wondering is the benefit of doing the
indexing on a non-data node. The documentation
http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/
explains well the benefit of running queries on a non-data node but
for indexing, is there a significant gain to run indexing requests on
a non-data node?

Thanks,
Colin

On Mon, Jan 31, 2011 at 12:49 PM, dbenson dbenson@dbenson.net wrote:

The best way to answer this is to run some load tests using your
incoming documents and your typical index queries.

We run four nodes all in a data configuration. We use the Java API for
indexing and searching. Our typical index load is 5-35 index requests
per second, this provides nearly no load. Our searches span multiple
indexes and have 10-25 criteria. Our search load is 100-150 queries/
sec against 31M docs, which we handle in the 34-40ms range.

David

Berkay_Mollamustafao · January 31, 2011, 8:59pm

You may want to do this for security though, no? Since ES does not have
authentication, SSL, etc. it may make sense to have a front end that is a no
data node, and keep ES data nodes behind a firewall, only accessible via the
front end.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Mon, Jan 31, 2011 at 3:11 PM, Shay Banon shay.banon@elasticsearch.comwrote:

In general, its a good practice to have all your nodes data nodes. If you
have a machine that is just hosting an HTTP ES proxy node (non data), then
its probably wasteful, and its better to have it as a data node to shared
the load.

On Monday, January 31, 2011 at 9:35 PM, Colin Surprenant wrote:

Thanks for your input.

Actually, what I am really wondering is the benefit of doing the
indexing on a non-data node. The documentation
http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/
explains well the benefit of running queries on a non-data node but
for indexing, is there a significant gain to run indexing requests on
a non-data node?

Thanks,
Colin

On Mon, Jan 31, 2011 at 12:49 PM, dbenson dbenson@dbenson.net wrote:

The best way to answer this is to run some load tests using your
incoming documents and your typical index queries.

We run four nodes all in a data configuration. We use the Java API for
indexing and searching. Our typical index load is 5-35 index requests
per second, this provides nearly no load. Our searches span multiple
indexes and have 10-25 criteria. Our search load is 100-150 queries/
sec against 31M docs, which we handle in the 34-40ms range.

David

colinsurprenant · February 2, 2011, 7:39pm

Shay,

So, in that respect, are the benefits of using a http-only node as
described here http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/
valid?? in particular <<This relieves the data nodes to do the heavy
duty of indexing and searching, without needing to process HTTP
requests (parsing), overload the network, or perform the gather
processing>>

Thanks,
Colin

On Mon, Jan 31, 2011 at 3:11 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

In general, its a good practice to have all your nodes data nodes. If you
have a machine that is just hosting an HTTP ES proxy node (non data), then
its probably wasteful, and its better to have it as a data node to shared
the load.

On Monday, January 31, 2011 at 9:35 PM, Colin Surprenant wrote:

Thanks for your input.

Actually, what I am really wondering is the benefit of doing the
indexing on a non-data node. The documentation
http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/
explains well the benefit of running queries on a non-data node but
for indexing, is there a significant gain to run indexing requests on
a non-data node?

Thanks,
Colin

On Mon, Jan 31, 2011 at 12:49 PM, dbenson dbenson@dbenson.net wrote:

The best way to answer this is to run some load tests using your
incoming documents and your typical index queries.

We run four nodes all in a data configuration. We use the Java API for
indexing and searching. Our typical index load is 5-35 index requests
per second, this provides nearly no load. Our searches span multiple
indexes and have 10-25 criteria. Our search load is 100-150 queries/
sec against 31M docs, which we handle in the 34-40ms range.

David

kimchy · February 2, 2011, 8:41pm

It is still valid, but, if you end up having this load balancer on a single machine that does just that, its better to have it as another data node. The way that I was visioning it is you have a load balancer machine that hosts other load balancer like apache / nginx, you can have an ES load balancer as well running there.
On Wednesday, February 2, 2011 at 9:39 PM, Colin Surprenant wrote:

Shay,

So, in that respect, are the benefits of using a http-only node as
described here http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/
valid?? in particular <<This relieves the data nodes to do the heavy
duty of indexing and searching, without needing to process HTTP
requests (parsing), overload the network, or perform the gather
processing>>

Thanks,
Colin

On Mon, Jan 31, 2011 at 3:11 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

In general, its a good practice to have all your nodes data nodes. If you
have a machine that is just hosting an HTTP ES proxy node (non data), then
its probably wasteful, and its better to have it as a data node to shared
the load.

On Monday, January 31, 2011 at 9:35 PM, Colin Surprenant wrote:

Thanks for your input.

Actually, what I am really wondering is the benefit of doing the
indexing on a non-data node. The documentation
http://www.elasticsearch.com/docs/elasticsearch/modules/node/data_node/
explains well the benefit of running queries on a non-data node but
for indexing, is there a significant gain to run indexing requests on
a non-data node?

Thanks,
Colin

On Mon, Jan 31, 2011 at 12:49 PM, dbenson dbenson@dbenson.net wrote:

The best way to answer this is to run some load tests using your
incoming documents and your typical index queries.

We run four nodes all in a data configuration. We use the Java API for
indexing and searching. Our typical index load is 5-35 index requests
per second, this provides nearly no load. Our searches span multiple
indexes and have 10-25 criteria. Our search load is 100-150 queries/
sec against 31M docs, which we handle in the 34-40ms range.

David