How to (pre)filter data used in a visualization?

Hello,

I'm trying to build an histogram which must count only the first occurrences (chronologically speaking) of all recorded events for the corresponding period (in my data, a specific event can occur several times with a different outcome each time). From what I gathered so far, this might be done by using data aggregations.

However, I'm having trouble finding examples in Kibana on how to give an aggregation as in input to filter the elements being counted...

I'm not sure that I have been clear enough, do not hesitate to ask for further info.

Thanks in advance everybody :slight_smile:

Hi @AxelR,

I'm not aware of a way to do this during query time - you have to make sure your data is already indexed in a "de-duped" way.

One way to do this is to set up a transform job as described here: https://www.elastic.co/guide/en/elasticsearch/reference/7.6/put-transform.html
This will continuously run aggregations on your data and store the pre-aggregated results so you can build visualizations (like a histogram) on top of it

Let's say you detect an event being duplicated by its event_id field. By grouping by event_id and adding the min of your timestamp field to the result, the result index will only contain one document per event id (with the first occurence as its timestamp)

PUT _transform/first_event_transform
{
  "source": {
    "index": "all_events",
  },
  "pivot": {
    "group_by": {
      "event": {
        "terms": {
          "field": "event_id"
        }
      }
    },
    "aggregations": {
      "first_occurrence": {
        "min": {
          "field": "timestamp"
        }
      }
    }
  },
  // ...
}

Based on this index you can create your histogram aggregation as usual.

Thanks, I think I understand the general idea!
Since the transform job creates a new index, I guess I also have to store the outcome (correct/incorrect) of the first test, do I?

If you want to visualize it, yes - it's kind of similar to a sql query grouping by the event id - all fields you want to access to have to define together with the aggregation (because there could be multiple documents within each group).

For fetching the outcome of the first event, you probably have to resort to a scripted metric: https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-metrics-scripted-metric-aggregation.html

If you are just after all fields of the first document, a solution using a logstash pipeline is probably a better fit: https://www.elastic.co/blog/how-to-find-and-remove-duplicate-documents-in-elasticsearch

You need an additional service (Logstash) to process the data, but it's more straight forward for this kind of thing. Transforms are better suited if you just want to access aggregations of the groups.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.