Highlight events with only one value for a field

Hi there!

I'm trying to visualize in a data table only the documents which have never had a certain value for a specific field.

Let's say I have fields USERNAME and OUTCOME. I'd like to find all those docs which have only a specific outcome.
To make a more concrete example, given the following image:

image

I'd like to visualize only the user Derek, since it's the only one without any successful event.
Is there a way to do so?

Thank you!

Hi Fabio.
How many total users are we talking about here? This can be a problem at scale if you have users' related data spread across machines. You may want to look at creating an entity-centric index rather than an event-centric index to do any behavioural analysis at scale.

Hi Mark!

Thank you for your answer.

Though, for now, I'd be more than happy to know whether there is an actual solution to my problem, even not considering scalability.

Is there a way to show only those users who have always failed a given action?
Obviously filtering by "given_output: Fail" wouldn't be enough because it would simply remove the 'successful' lines (in the above image the second and fifth lines would be returned).

Thank you for your precious answer but at the moment I'm trying to understand if there is a feasible solution, even if not perfect.

The approaches involve grouping docs for all users using a ‘terms’ aggregation and trimming them as a final step to just the ones that have the one value. This can be done by a ‘scripted’ aggregation or perhaps a bucket filter pipeline aggregation but I don’t generally advise these approaches because they don’t scale with user volumes.

Thank you for your answer!

I know the above-mentioned approach might not be the best in terms of scalability but I only need it on 24-hour users data.

The problem is I might even find a way to get those data through a query+aggregation in Dev Tools but then it would return me some hits (which I do not need) and some aggs (which I do need).

Though, when I put the same query+aggs in any visualization (a table one for example) it shows me the result of hits, not the one of aggs.

Can you please make me a concrete example on how to proceed?
Thank you

I mean, inserting such docs in a testing index:

PUT aggs_test/_bulk?refresh
{"index":{"_id":1,"_type":"doc"}}
{"first":"Alice","outcome":"success"}
{"index":{"_id":2,"_type":"doc"}}
{"first":"Alice","outcome":"success"}
{"index":{"_id":5,"_type":"doc"}}
{"first":"Alice","outcome":"fail"}
{"index":{"_id":6,"_type":"doc"}}
{"first":"Alice","outcome":"fail"}
{"index":{"_id":7,"_type":"doc"}}
{"first":"Alice","outcome":"fail"}
{"index":{"_id":8,"_type":"doc"}}
{"first":"Alice","outcome":"success"}
{"index":{"_id":9,"_type":"doc"}}
{"first":"Alice","outcome":"success"}
{"index":{"_id":10,"_type":"doc"}}
{"first":"Bob","outcome":"success"}
{"index":{"_id":11,"_type":"doc"}}
{"first":"Bob","outcome":"success"}
{"index":{"_id":12,"_type":"doc"}}
{"first":"Bob","outcome":"fail"}
{"index":{"_id":13,"_type":"doc"}}
{"first":"Bob","outcome":"fail"}
{"index":{"_id":14,"_type":"doc"}}
{"first":"Charlie","outcome":"success"}
{"index":{"_id":15,"_type":"doc"}}
{"first":"Charlie","outcome":"fail"}
{"index":{"_id":16,"_type":"doc"}}
{"first":"Charlie","outcome":"fail"}
{"index":{"_id":17,"_type":"doc"}}
{"first":"Derek","outcome":"fail"}
{"index":{"_id":18,"_type":"doc"}}
{"first":"Derek","outcome":"fail"}
{"index":{"_id":19,"_type":"doc"}}
{"first":"Derek","outcome":"fail"}

I have something like this in the visualization:

Is there a simple way to filter out all the lines except the one with Derek (only one which has only fail events)?

I could show you some example of scripted aggs or bucket filter pipeline aggs but it would not be useful if you are only using Kibana. Kibana does not support these more complex query functions.
Aside from scalability, that's the other advantage of an entity-centric index - it creates pre-calculated attributes that make it simple to consume in Kibana.

So, for such a case, how would you proceed exactly?

Anyway, if you could enlighten me with a fast aggregation to extrapolate such data I'd be extremely grateful to you!

Thank you for all your time and answers.

I'd adapt the example indexing scripts in my original link to create an entity centric index from my events index.

It involves a map/reduce style query script so is essentially a programming exercise - see Scripted metric aggregation | Elasticsearch Guide [8.11] | Elastic

I'll do this indeed as soon as possible.

This is very useful but it doesn't take into consideration possible aggregations.

I mean, running the following query on the Dev Tools,

GET aggs_test/_search
{
  "size": 0,
  "aggs": {
    "username": {
      "terms": {
        "field": "first.keyword"
      },
      "aggs": {
        "outcomes": {
          "terms": {
            "field": "outcome.keyword"
          }
        }
      }
    }
  }
}

I got something like this:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 19,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "username" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Alice",
          "doc_count" : 9,
          "outcomes" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "success",
                "doc_count" : 6
              },
              {
                "key" : "fail",
                "doc_count" : 3
              }
            ]
          }
        },
        {
          "key" : "Bob",
          "doc_count" : 4,
          "outcomes" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "fail",
                "doc_count" : 2
              },
              {
                "key" : "success",
                "doc_count" : 2
              }
            ]
          }
        },
        {
          "key" : "Charlie",
          "doc_count" : 3,
          "outcomes" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "fail",
                "doc_count" : 2
              },
              {
                "key" : "success",
                "doc_count" : 1
              }
            ]
          }
        },
        {
          "key" : "Derek",
          "doc_count" : 3,
          "outcomes" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "fail",
                "doc_count" : 3
              }
            ]
          }
        }
      ]
    }
  }
}

What should I add (bucket_script?) to the existing query to have only the Derek bucket returned?
Which is the way to make a comparison like "return only those buckets whose doc_count == key_fail>doc_count?"

I know I'm asking a lot but you're really helping!

It would require a bucket selector to trim the results

Ok but given the syntax of the bucket selector

{
    "bucket_selector": {
        "buckets_path": {
            "my_var1": "the_sum", 
            "my_var2": "the_value_count"
        },
        "script": "params.my_var1 > params.my_var2"
    }
}

And the structure of my aggs

...
"aggregations" : {
    "username" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Alice",
          "doc_count" : 9,
          "outcomes" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "success",
                "doc_count" : 6
              },
              {
                "key" : "fail",
                "doc_count" : 3
              }
            ]
          }
        },
        ...

I cannot find the right syntax to store in my_var1 the external doc_count (here 9) and in my_var2 the fail doc_count (here 3) in order to write a script like

"script": "params.my_var1 == params.my_var2"

Ideas?

I'm unclear myself - I know there are limits to what aggregated values you can address in bucket selectors.

The scripted-metric aggregation I linked to allows you complete control over what doc values you access, how you organize them (into maps etc), how you fuse them and how you trim them. You just have to write the painless code to do all the work.

Now let's compare that with what an entity-centric update script would look like:

POST test/_doc/my_user/_update
{
  "scripted_upsert":true,
  "upsert": {}, 
  "script": {
	"lang": "painless",
	"source": """
	def doc=ctx._source;
	for (event in params.newEvents){
	  if (event["outcome"]=="success") {
		doc["succeeded"]=true
	  }
	}
	""",
	"params": {
	  "newEvents":[
		{ "outcome":"fail"},
		{ "outcome":"success"}
		]
	}
  }
}

You just pass the latest batch of events as params to an update script (see my generic Python script for how to bundle events into entity update calls). The above update script sets a "succeeded" attribute on the entity if any event is of the type success. Other attributes like "lastOutcome" or "lastActionDate" are trivial to maintain here too. You can then use these attributes in Kibana.

Thank you for your kind reply.

For now, I solved it with a plugin but what you explained here is very interesting and I'll surely look at it and watch your video asap.

Thank you again for the kindness and the info!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.