Faceted Statistics - How to create drill down data or best practices


(Mike Kelp) #1

Hey All,

Disclaimer: This is my first post on this group and I get a little
wordy...but I love this stuff...

I've been playing with Elastic Search for a semi-real time network analysis
and alert system, which I'm excited to say I'm achieving real progress on.
After discovering exploring MongoDB, CouchDB, TerraStore, Riak, and then
Elastic Search (in that order) I believe Elastic Search is the best
candidate. The problem was getting the right combination of indexing and a
dumb document data store. I found map / reduce quite slow for our purposes
and the indexing very limited among the other data stores. So far I've been
very impressed with the performance, the API, and the cleanliness of the
system as a whole.

Anyway, as the subject states, I'm interested in creating a model similar
to a faceted search, but instead of searching documents directly with
counts and facets, the statistics are my facets. As a simple example,
consider bandwidth in and out as gathered from a simple web server log or
something along those lines. I can get a total of all of that data with the
statistics facet beautifully and quickly with all my web logs indexed in
one default index (I haven't gotten too deep into optimizing indexes yet
and when it is best to separate, etc.). The problem comes when I want to
then see that data by server (hostIp), or client, or hostname (domain),
progressively drilling down to determine where the bulk of bandwidth may be
represented and determining the cause. You could apply this to many
statistics, but this is a very simple example that I think shows the issue.
In these cases, counts aren't the VALUE of my search, but SUMS and other
mathematical aggregations. They define the users wish to drill down,
whether it be to investigate abnormally high or low bandwidth and determine
the subsets in which the problem occurs.

Question: How would I best achieve this in Elastic Search or how would
you suggest I go about adding it to the project, while respecting the API
and the purpose of Elastic Search as a whole?

If this is something I can do effectively, I would love to submit any
resulting code to the project as well. I simply don't want to go mucking
about the code with no appreciation of its complexity or architecture, when
there is likely an ideal place to add this feature that I would have to
find anyway. I have already begun looking at some of the code and am
considering implementing a grouped facet with common operations or
something of the sort as it is a good spot for aggregations and really,
grouping is effectively a simple map / reduce style operation.

In the end, I imagine a system in which this form of drill down can occur,
but we are managing mass amounts of network data over a rolling time period
(say 60 days) with real time searching of all the data there, as well as
maintaining near-real time aggregate reports that include aggregations of
all data as it is rolled out of the time window, allowing the system to
maintain performance while tracking the most important information over all
time and not growing storage for the sake of storing everything with no
real purpose.

*Question: *As a broader question, am I playing well into the use cases for
Elastic Search or is there another strategy you would recommend?

Lastly, Shay, thank you so much for this awesome project and your long-term
passion for search. It is clear from all of the research and experiences
you have shared with the community that you have a unique perspective and
the ability to act on it. I look forward to seeing you rewarded for your
efforts on this project (and assisting where possible) as we all build upon
and apply your research / contributions. Cheers.

Mike.


(system) #2