Scaling Elastic Search? (no-data-nodes the norm?)

Please excuse the naivety here. I'm trying to translate from a SOLR
background, where we had up a ton of read-only query slaves, and a couple
masters to handle the write traffic.

Here is our situation:

We have a set of REST-based web services that act as a thin translation
layer on top of Elastic Search. Those web services are hosted in
DropWizard (Jetty). From those JVMs, we connect to Elastic Search using
the Elastic Search Java API. We are trying to figure out the best
deployment topology to scale out the system.

We've seen mention of using non-data nodes to handle the incoming queries.
Those nodes front the actual data nodes, handling the HTTP traffic. We've
also seen mention of making those the master nodes. What is the best
practice here? Something like this...?

HTTP -> Jetty -> HTTP Load Balancer -> Non-Data Nodes (Masters) -> Data
Nodes

We were wondering, if we just create non-data embedded nodes within the
JVM's hosting the REST services, does this achieve the same goal of
offloading the HTTP connections from the data nodes? Effectively, the JVM
hosting the REST-based services becomed the buffer between the HTTP
connections and the data nodes. In this scenario we would have...

HTTP -> Jetty w/ Embedded ES node -> Data Nodes

This seems reasonable, but we weren't sure the impact as we scale the # of
REST hosts (and consequently, the # of ES nodes)

Anyone have any links, or advice you can share?

thanks -- still a noob at scaling ES,

-brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

HTTP -> Jetty w/ Embedded ES node -> Data Nodes

Yep, that seems reasonable to me. It isn't uncommon to have dozens of
client nodes (aka. non-data, non-master nodes). People will very often add
a local client node to each HTTP box, and since you are running a JVM
anyway, embedding is basically the same.

The only thing to be aware of is that client nodes will need some amount of
memory for reducing result sets. Imagine your query asks for 10,000
results, and your index has 10 shards. The client node needs enough memory
to make a priority queue size 100,000 to find the top 10k results. In
general, you shouldn't be asking for result sets that large anyway, but
it's something to keep in mind if your REST machines are lightweight.

There is a certain amount of overhead to more nodes in the cluster, since
any change to cluster state has to be published to all nodes...but in
general this is fairly light.

I would avoid making the REST machines master-eligible however. Keep them
as non-data/non-master. Some people do it, but I feel that is defeating
the purpose of a dedicated master node (enhanced stability due to limited
responsibility and load). I'd either make the data nodes master-eligible,
or add a few light master-only nodes that handle the master
responsibilities.

-Zach

On Monday, September 16, 2013 9:03:58 AM UTC-4, Brian O'Neill wrote:

Please excuse the naivety here. I'm trying to translate from a SOLR
background, where we had up a ton of read-only query slaves, and a couple
masters to handle the write traffic.

Here is our situation:

We have a set of REST-based web services that act as a thin translation
layer on top of Elastic Search. Those web services are hosted in
DropWizard (Jetty). From those JVMs, we connect to Elastic Search using
the Elastic Search Java API. We are trying to figure out the best
deployment topology to scale out the system.

We've seen mention of using non-data nodes to handle the incoming queries.
Those nodes front the actual data nodes, handling the HTTP traffic. We've
also seen mention of making those the master nodes. What is the best
practice here? Something like this...?

HTTP -> Jetty -> HTTP Load Balancer -> Non-Data Nodes (Masters) -> Data
Nodes

We were wondering, if we just create non-data embedded nodes within the
JVM's hosting the REST services, does this achieve the same goal of
offloading the HTTP connections from the data nodes? Effectively, the JVM
hosting the REST-based services becomed the buffer between the HTTP
connections and the data nodes. In this scenario we would have...

HTTP -> Jetty w/ Embedded ES node -> Data Nodes

This seems reasonable, but we weren't sure the impact as we scale the # of
REST hosts (and consequently, the # of ES nodes)

Anyone have any links, or advice you can share?

thanks -- still a noob at scaling ES,

-brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

My suggestion is

Front end { HTTP -> (ucarp) -> nginx -> Jetty + ES TransportClient
singleton } -> Back end { ES data nodes }

With this configuration, you have a clear separation of machines into front
and back end ones, each scalable for themselves. The front end can consist
of cheap machines. ucarp protects the frontend machines against failure,
nginx is the load balancer and handles HTTPS and auth, and Jetty can
connect even to multiple clusters in private networks (for example, to
staging/production envs). The Jetty JVM can optionally be configured for
some extra load of ES result processing, if result sets are expected to
become very large. In addition to my REST workflows I have to transform
JSON->XML->MODS->HTML. I never felt the need of setting up data-less nodes.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Perfect. Thanks Zach & Jörg for the feedback!

We'll plan to move forward with this and let you know how it goes.

all the best,
brian


Brian O'Neill
Chief Architect
Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42healthmarketscience.com

This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited.

On Sep 16, 2013, at 7:19 PM, joergprante@gmail.com wrote:

My suggestion is

Front end { HTTP -> (ucarp) -> nginx -> Jetty + ES TransportClient singleton } -> Back end { ES data nodes }

With this configuration, you have a clear separation of machines into front and back end ones, each scalable for themselves. The front end can consist of cheap machines. ucarp protects the frontend machines against failure, nginx is the load balancer and handles HTTPS and auth, and Jetty can connect even to multiple clusters in private networks (for example, to staging/production envs). The Jetty JVM can optionally be configured for some extra load of ES result processing, if result sets are expected to become very large. In addition to my REST workflows I have to transform JSON->XML->MODS->HTML. I never felt the need of setting up data-less nodes.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.