Hello all, I was curious about your input on a potential use case for
Elasticsearch. We collect massive amounts of netflow data (which is
essentially a description of a TCP conversation that occurred between two
endpoints through a router/switch/etc) which we save and analyze using
basic netflow tools. Our current software solution is deployed on a single
machine and can only store very limit quantities of netflow records, and
does basic functions to read that data. We would like to use a scalable
database of some kind to store the information that gives us the ability to
do interesting queries over millions or even billions of records. We have
experimented with Cassandra in this regard, however Cassandra really does
not allow one to ask questions about the data as it's not capable of any
aggregation other than a simple row count (we did create multiple data
models that could answer a number of our queries, but being able to do ad
hoc aggregation is much more appealing).
Our use case involves storing these netflow records which are comprised of
a number of fields of data such as source IP address, destination IP
address, source port, destination port, total bytes in the conversation,
and a few others. We would like to be able to do interesting queries of
the data such as asking the question "How many destination IP address's did
the source IP address of w.x.y.z connect to between 4am and 7am on Tuesday
last week?", or maybe ask the question "How many bytes of data were sent to
the destination port 443 during the month of May?".
The biggest consideration is the size of the dataset, which we estimate
would be adding tens of millions of records a day. Considering this, would
Elasticsearch be capable of handling such a large volume of data or be able
to do efficient searches over said data?
10 millions of records in one hour? Yes, or did you mean per day ? Answer
is still yes.
The size of the dataset can be managed with ES, from my perspective. Your
Client should be a little bit smart.
Not trying to shoot down the idea (or maybe i'll hear some interesting
counter opinions! ) - but at the moment there isn't a clear, well tested
and reliable way to snapshot and backup an ES cluster's data - specially on
a large, and active cluster (one where you simply cant stop sending updates
/ queries to).
AFAICT , ES is not to be considered your source of data. ( yes, to us it
means keep the source of data somewhere 'stable' as a SQL DB, and have some
trusted, tested process to rebuild the index).
Again, maybe (Hopefully), I'm wrong and I have missed some announcement wrt
support for native snapshots / log streaming a'la RDBMS replication to a
separate cluster for backups / snapshots / geographical
federation/distribution..
10 millions of records in one hour? Yes, or did you mean per day ? Answer
is still yes.
The size of the dataset can be managed with ES, from my perspective. Your
Client should be a little bit smart.
Thanks for the insight, I am very new to Elasticsearch and I had wondered
whether this system billed itself as a true data warehouse or simply a
ephemeral search index. It seems it could take on a new life as a database
considering it's capabilities working with such large data. We are now
looking at Datastax Enterprise Solr integration which adds search on top of
Cassandra, and possibly using Cassandra and Elasticsearch in conjunction.
On Wednesday, July 31, 2013 4:21:15 AM UTC-5, Norberto Meijome wrote:
Again, maybe (Hopefully), I'm wrong and I have missed some announcement
wrt support for native snapshots / log streaming a'la RDBMS replication to
a separate cluster for backups / snapshots / geographical
federation/distribution.
Thanks for clarifying, I did go back and read more of the documentation and
have become familiar with replicas and persistence. Can you speak to the
use case I illustrated above in any fashion? We are very curious to hear
about operational experiences scaling Elasticsearch and performance writing
to and reading from massive data sets.
Aggregations is really noble feature, but strictly speaking, I do not think
it is really a map-reduce. Or is it?
I might be wrong, but AFAICT there is not shuffle phase (it would be
expensive in real-time), so instead of doing aggregations for all key
values in single reducer, there are probably done a lot of particular
segment (or node) level aggregations and then all these results are
combined (so this IMO relies on the fact that you can get the final result
by additions).
But as I said, I would be happy to be proven wrong.
aggregations are one piece for certain map-reduce-like algorithms, it could
be accompanied by a modified bulk indexing action for presorting data so
they arrive in-place at the shards to get them effectively processed by the
aggregation framework.
I agree that ES will never be Hadoop but I see no reason why the ES
distributed architecture should not be extensible to run some sort of
simple map-reduce analysis on indexed documents. For example, creating
ordered lists of all the values of all fields, for statistics. For example,
in bibliographic data, librarians tend to ask how many occurrences of
values are in what fields and what fields are less used.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.