Uniquifying results on a field before applying histogram facet?

Some background, so hopefully the question makes sense.

On the OpenStack project, we're building some tooling around our logstash /
elastic search back end that we use for all our CI results. We generate
about 25k cloud test runs a week, each generates a couple dozen log files
(one for each of the services), which we then push into logstash for
indexing. With a system as complex as OpenStack, being an async messaging
architecture, we'll see test results fail "randomly", which is just really
an exposure of an underlying race in some way. Given that we do those 25k
clouds a week, a race which shows up 0.1% (on average) of the time in our
enviroment, will give us 25 unique events a week, which starts to be
something you can go after. We've started building a rudimentary system
which uses some hand curated elastic search queries that fingerprint
particular bugs that we've seen before. It helps us figure out if we are
looking at new races, or known races, and prioritize bugs based on how
often we're seeing the race exposed in our environment.

So, now the question:

It turned out to be really useful to build graphs to visualize bugs coming
and going during the end of the last release. We built the histograms
manually. For low frequency events that was fine, but it would be really
great to actually use the histogram faceting facilities of elastic search,
as it pushes the logic all back to one place (and realistically is much
faster)

The issue with this is sometimes our fingerprints aren't unique within a
build. Some fingerprints show up only once, some show up multiple times.
That means if we only facet on the histogram, we can't really compare
overall runs to the fingerprints to get statistical comparisons. Given that
logstash creates a new document for each logical log line, our best bet
would be if we could uniquify our results on the fields.build_uuid.

Is there any way to do this? Is there some other approach we could use to
get similar results out of elastic search directly? or are we going to need
to drag all the results over the wire and process on our side in python?

This is our basic json request for the histogram faceting:

{
"sort": {
"@timestamp": {"order": "desc"}
},
"query": {
"query_string": {
"query": raw_query
}
},
"facets": {
"histo": {
"date_histogram": {
"field": "@timestamp",
"interval": unit,
}
}
}
}

This is what a logstash record that we're getting back looks like (to
understand the metadata we've constructed).

            "_source": {
                "@tags": [
                    "console.html"
                ], 
                "@fields": {
                    "build_status": "FAILURE", 
                    "build_patchset": "2", 
                    "build_ref": 

"refs/zuul/master/Zc3927ff483b64f098e22a6e373ddb6de",
"log_url":
"http://logs.openstack.org/84/51584/2/check/check-tempest-devstack-vm-postgres-full/158336b/console.html",
"project": "openstack/python-ceilometerclient",
"build_change": "51584",
"filename": "console.html",
"build_name":
"check-tempest-devstack-vm-postgres-full",
"build_uuid": "158336b74911487ca6b8874589b3e321",
"received_at": [
"2013-10-14T23:25:23.957Z"
],
"build_queue": "check"
},
"@timestamp": "2013-10-14T23:23:55.214Z",
"@source_path": "/",
"@source": "tcp://127.0.0.1:52966/",
"@source_host": "127.0.0.1",
"@message": "Details: Time Limit Exceeded! (400s)while
waiting for active, but we got killed.",
"@type": "jenkins"
},

Any thoughts or comments would be appreciated.

-Sean

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.