Help: Is ElasticSearch the right tool for us?

We are currently evaluating alternatives for two of our use cases because
we are slowly hitting the roof regarding performance. Elastic Search looks
like to be a pretty good candidate for us! However, I wondered if someone
out there in this group could tell me if it really is the right choice for
us.

Our first use case currently manages 40 Million+ Documents (Maps) that
could easily be structured as JSON documents with an average size of 10-20
KB per document. Documents are identified by a unique id prefixed with some
kind of a "partition" identifier, a la -. Those
logical partitions are not balanced and contain from 50.000-100.000 up to a
few million documents. Partitions typically grow slow, but in large
batches, e.g. if a partition grows, then thousands of documents are added
in a batch. Once a partition is populated, then around 25% of the documents
within the partitition are updated around 3 times a day. Each partition
must be read in batches of around 40.000-50.000 documents around 3 times a
day. Documents are fetched on an id basis, so we have to hit the database
with a list of a few thousand ids. Currently our ids are evenly spread
within a partition (due to the usage of a UUID). We, however, plan to
change this, so that data is read together often, has ids that are closely
related to each other (in an alphabetical sense). We are currently using a
combination of MySQL and Lucene with a pretty trivial MySQL schema -
basically a primary key and a blob where the documents are stored. We are
then indexing documents with Lucene. The Lucene Index is queried by the
application for document ids that are then fetched from the database. For
indexing we use many of the Lucene gems in order to provide rich query
possibilities, so we need full power for manual indexing configuration (via
code extensions ?) and query building / parsing. The bad thing is, that one
of our requirements is to immediately search for freshly stored or updated
documents - but we have some (!) time, since there's no user sitting on the
other side staring on the screen :o) We currently index right after
storing, which is typically again done in batches of around 40.000-50.000
documents and indexing currently takes a few seconds.

Our second use case are time range aware aggregations among many 100
millions of rows. E.g. How many clicks did we have in the last 31 days -
returned as a series of data grouped by day. Data is again structured in
partitions, where in this case a partition is a combination of numeric
values (a composite primary key). We are currently using denormalized MySQL
tables with some strategic indices to support typical where and group by
clauses. We have to insert/update up to a few million rows per hour, where
98% of all incoming rows are updates and 2% are new rows. Queries must be
have a very low latency and along with aggregation queries we will have
many documents concurrently accessed by primary key.

Thanks in advance for your advice!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4f7a6daf-0949-437c-a70c-50a5d65f8dcb%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Peter,

It sounds like Elasticsearch is perfect for both of your use cases. The
trick is getting a server cluster (or maybe one for each use case) set up
appropriately that will meet your needs. Elasticsearch scales seamlessly
both horizontally and vertically, and will take full advantage of whatever
hardware you give it. Figuring out how many servers you need and how
powerful they need to be will take some time and experimentation, but I
have no doubt that Elasticsearch can handle your data needs.

Since you are already using Lucene, you will be familiar with may of the
capabilities of Elasticsearch already. Searching for newly indexed data is
not a problem since refresh rate is configurable, but there are performance
trade-offs. In many cases Elasticsearch can even function as the primary
data store, eliminating the need to store your data in two places.

Have you seen the new aggregations feature coming in Elasticsearch 1.0,
yet? It sounds like it could help with your second use case.

My company StackSearch provides hosted Elasticsearch (on both Amazon EC2
and Rackspace, in all their respective data centers) at http://qbox.io. We
also provide consulting services for creating and managing data flow
strategies, and are an official reseller of Elasticsearch (the company)
support contracts. Please let me know if there is anything we can do to
help you.

On Thursday, December 19, 2013 1:46:09 PM UTC-6,
peter.r...@smarter-ecommerce.com wrote:

We are currently evaluating alternatives for two of our use cases because
we are slowly hitting the roof regarding performance. Elastic Search looks
like to be a pretty good candidate for us! However, I wondered if someone
out there in this group could tell me if it really is the right choice for
us.

Our first use case currently manages 40 Million+ Documents (Maps) that
could easily be structured as JSON documents with an average size of 10-20
KB per document. Documents are identified by a unique id prefixed with some
kind of a "partition" identifier, a la -. Those
logical partitions are not balanced and contain from 50.000-100.000 up to a
few million documents. Partitions typically grow slow, but in large
batches, e.g. if a partition grows, then thousands of documents are added
in a batch. Once a partition is populated, then around 25% of the documents
within the partitition are updated around 3 times a day. Each partition
must be read in batches of around 40.000-50.000 documents around 3 times a
day. Documents are fetched on an id basis, so we have to hit the database
with a list of a few thousand ids. Currently our ids are evenly spread
within a partition (due to the usage of a UUID). We, however, plan to
change this, so that data is read together often, has ids that are closely
related to each other (in an alphabetical sense). We are currently using a
combination of MySQL and Lucene with a pretty trivial MySQL schema -
basically a primary key and a blob where the documents are stored. We are
then indexing documents with Lucene. The Lucene Index is queried by the
application for document ids that are then fetched from the database. For
indexing we use many of the Lucene gems in order to provide rich query
possibilities, so we need full power for manual indexing configuration (via
code extensions ?) and query building / parsing. The bad thing is, that one
of our requirements is to immediately search for freshly stored or updated
documents - but we have some (!) time, since there's no user sitting on the
other side staring on the screen :o) We currently index right after
storing, which is typically again done in batches of around 40.000-50.000
documents and indexing currently takes a few seconds.

Our second use case are time range aware aggregations among many 100
millions of rows. E.g. How many clicks did we have in the last 31 days -
returned as a series of data grouped by day. Data is again structured in
partitions, where in this case a partition is a combination of numeric
values (a composite primary key). We are currently using denormalized MySQL
tables with some strategic indices to support typical where and group by
clauses. We have to insert/update up to a few million rows per hour, where
98% of all incoming rows are updates and 2% are new rows. Queries must be
have a very low latency and along with aggregation queries we will have
many documents concurrently accessed by primary key.

Thanks in advance for your advice!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0c0fd9a2-0858-42ef-b226-6c8528576d16%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.