We are currently evaluating alternatives for two of our use cases because
we are slowly hitting the roof regarding performance. Elastic Search looks
like to be a pretty good candidate for us! However, I wondered if someone
out there in this group could tell me if it really is the right choice for
us.
Our first use case currently manages 40 Million+ Documents (Maps) that
could easily be structured as JSON documents with an average size of 10-20
KB per document. Documents are identified by a unique id prefixed with some
kind of a "partition" identifier, a la -. Those
logical partitions are not balanced and contain from 50.000-100.000 up to a
few million documents. Partitions typically grow slow, but in large
batches, e.g. if a partition grows, then thousands of documents are added
in a batch. Once a partition is populated, then around 25% of the documents
within the partitition are updated around 3 times a day. Each partition
must be read in batches of around 40.000-50.000 documents around 3 times a
day. Documents are fetched on an id basis, so we have to hit the database
with a list of a few thousand ids. Currently our ids are evenly spread
within a partition (due to the usage of a UUID). We, however, plan to
change this, so that data is read together often, has ids that are closely
related to each other (in an alphabetical sense). We are currently using a
combination of MySQL and Lucene with a pretty trivial MySQL schema -
basically a primary key and a blob where the documents are stored. We are
then indexing documents with Lucene. The Lucene Index is queried by the
application for document ids that are then fetched from the database. For
indexing we use many of the Lucene gems in order to provide rich query
possibilities, so we need full power for manual indexing configuration (via
code extensions ?) and query building / parsing. The bad thing is, that one
of our requirements is to immediately search for freshly stored or updated
documents - but we have some (!) time, since there's no user sitting on the
other side staring on the screen :o) We currently index right after
storing, which is typically again done in batches of around 40.000-50.000
documents and indexing currently takes a few seconds.
Our second use case are time range aware aggregations among many 100
millions of rows. E.g. How many clicks did we have in the last 31 days -
returned as a series of data grouped by day. Data is again structured in
partitions, where in this case a partition is a combination of numeric
values (a composite primary key). We are currently using denormalized MySQL
tables with some strategic indices to support typical where and group by
clauses. We have to insert/update up to a few million rows per hour, where
98% of all incoming rows are updates and 2% are new rows. Queries must be
have a very low latency and along with aggregation queries we will have
many documents concurrently accessed by primary key.
Thanks in advance for your advice!
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4f7a6daf-0949-437c-a70c-50a5d65f8dcb%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.