Questions about architecting and design

I have been doing some extensive research on the work here with
elastic search and am planning to use this indexing technology as the
operational datastore for our new enterprise architecture. We are
redesigning from scratch and have chosen to use an index for a near
real-time operational data store instead of a rdbms or other index
technologies like Solr and Endeca.

In initiating this project I was wondering if you would advise us in a
few things

  1. What is recommended as far as the particular number of nodes to
    start out with? How many cores? How much RAM per node? Is there a
    matrix or guidelines as to how to determine these things?
  2. Ignoring any monetary limitations would several small nodes in a
    cluster be ideal over a few large nodes or vice versa?
  3. Would the recommendation be to use in memory storage or filesystem
    storage option? It seems in memory would perform better but if the
    rate of growth is fast enough may cause some issues long term (meaning
    more nodes in order to keep up with growth vs increased file storage
    to keep up with growth)
  4. If we determine we would like assistance in a more personal way,
    what is the recommended way to go about finding someone with Elastic
    Search experience that would be willing to help out?

Since we are starting from scratch it's hard to determine the exact
details at this point for the data that will be indexed. But we are
basing our other stack components on the fact that we will be indexing
hundreds of millions of records at a steady growth of 10,000/day and
will need to anticipate 1,000 query request/sec and 50 insert/update
request/sec

If more info is needed please let me know or if there are any concerns
in using this technology this way feel free to voice those as well

On Tue, Jan 24, 2012 at 11:58 AM, Wes Plunk wes@wesandemily.com wrote:

I have been doing some extensive research on the work here with
Elasticsearch and am planning to use this indexing technology as the
operational datastore for our new enterprise architecture. We are
redesigning from scratch and have chosen to use an index for a near
real-time operational data store instead of a rdbms or other index
technologies like Solr and Endeca.

My company did something similar and we settled on Elasticsearch.

In initiating this project I was wondering if you would advise us in a
few things

  1. What is recommended as far as the particular number of nodes to
    start out with? How many cores? How much RAM per node? Is there a
    matrix or guidelines as to how to determine these things?

As with almost any application, the more cores and ram, the better. A lot
of it depends on budget, what your data looks like, and your requirements.
You'll also need some fast disks, especially for fast writes.

  1. Ignoring any monetary limitations would several small nodes in a
    cluster be ideal over a few large nodes or vice versa?

Depends on data and requirements.

  1. Would the recommendation be to use in memory storage or filesystem
    storage option? It seems in memory would perform better but if the
    rate of growth is fast enough may cause some issues long term (meaning
    more nodes in order to keep up with growth vs increased file storage
    to keep up with growth)

Fast disks will help you. If your data is very large, then it's going to be
tough to keep it all in memory. If you can keep everything in ram, then
you'll have an amazingly fast system.

  1. If we determine we would like assistance in a more personal way,
    what is the recommended way to go about finding someone with Elastic
    Search experience that would be willing to help out?

Since we are starting from scratch it's hard to determine the exact
details at this point for the data that will be indexed. But we are
basing our other stack components on the fact that we will be indexing
hundreds of millions of records at a steady growth of 10,000/day and
will need to anticipate 1,000 query request/sec and 50 insert/update
request/sec

10,000 inserts/day is not much. We've been able to index 5,000+ records/sec
on a 4-node cluster on AWS using "large" nodes. This would be about average
over 100m+ records. 50 inserts/sec is nothing, unless the records are
multi-meg each.

1,000 requests/sec should be very doable. You'll need to do some
benchmarking to figure out how to manage your nodes. Generally, adding more
replicas increases read scalability.

If more info is needed please let me know or if there are any concerns
in using this technology this way feel free to voice those as well

So far, ES has been excellent, especially in terms of speed and
scalability. We currently have 170m+ records indexed on 4 relatively small
nodes and are completing single searches in <1sec. We anticipate having
about 1b records indexed over the next few months. We're looking at 20+
nodes at that point. The great thing is, we we need more capacity, simple
spin up some new nodes and add them to the cluster. ES takes care of almost
everything else.

--

CRAIG BROWN
chief architect
youwho, Inc.

www.youwho.com http://www.youwho.com/

T: 801.855. 0921
M: 801.913. 0939