Histogram (bars) with counts as bins


#1

Hi,
we are logging meetings and their participants; let's say, we log pairs (A,B) where A is the meeting ID and B is the participant ID. Abstractly, this gives us a mapping c: Meeting -> Number of participants, A -> #{B suchthat (A,B) is an event}. From that we want to draw a histogram, where the bins are the number of participants, and the histogram tells you: For that number of participants we have that number of meetings, i.e. the mapping n -> #{A suchthat c(A)=n} (where c is the auxiliary mapping).

Is that possible in Kibana?

Thanks


(Tim Sullivan) #2

Abstractly, this gives us a mapping

When we talk about "mapping" in this forum, generally that means the type definitions of the fields in the actual documents in Elasticsearch. I'm not clear on if you have participant ID in your mapping, or the number of participants in your mapping.

Sounds to me like you are looking for trends in the number of participants in meetings.

{
  "@date": <some timestamp>,
  "meetingId": "meeting-12345",
  "participantId": "participant-98765",
  # maybe some other fields
}

or

{
  "@date": <some timestamp>,
  "meetingId": "meeting-12345",
  "numParticipants": 20,
  # maybe some other fields
}

If you have a document for each meetingID / participantID combination:

We can get each meeting and the number of unique participants in each meeting, and compare visually for the commonalities of number of participants in each meeting. Beyond that, I don't see a way to count each number of participants in meetings and aggregate on that metric.

NOTE: I spent a bit of time playing with various aggregations in Elasticsearch to see what a raw query is capable of. If you look more into the documentation on the pipeline aggregation options, you might be able to find more drill-in than I was able to get. This query can provide some stats that you might find interesting:

{
  "aggs": {
    "meetings": {
      "terms": {
        "field": "meetingId",
        "size": 100
      },
      "aggs": {
        "num_participants": {
          "cardinality": {
            "field": "participantId"
          }
        }
      }
    },
    "participant_stats": {
      "stats_bucket": {
        "buckets_path": "meetings>num_participants"
      }
    }
  }
}

That will give you stats on the number of meetings and the min/max/average number of participants in each meeting. Not a histogram, but some high-level metrics. Example result:

"aggregations": {
  "meetings": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "meeting-iuh5js77h3",
        "doc_count": 29,
        "num_participants": {
          "value": 12
        }
      },
      {
        "key": "meeting-s6k67hl5sj",
        "doc_count": 28,
        "num_participants": {
          "value": 10
        }
      },
      {
        "key": "meeting-45i6jhhls1",
        "doc_count": 14,
        "num_participants": {
          "value": 9
        }
      },
      {
        "key": "meeting-sh58f6gihj",
        "doc_count": 13,
        "num_participants": {
          "value": 8
        }
      },
      {
        "key": "meeting-4dglkd98ss",
        "doc_count": 10,
        "num_participants": {
          "value": 8
        }
      },
      {
        "key": "meeting-g4sdfjldsj",
        "doc_count": 6,
        "num_participants": {
          "value": 4
        }
      }
    ]
  },
  "participant_stats": {
    "count": 6,
    "min": 4,
    "max": 12,
    "avg": 8.5,
    "sum": 51
  }
}

If you have numParticipants in your mapping:

You can make a bar chart where the x-axis is a spread of the numParticipants field. To have the X-Axis number appear in numerical order, order the X-Axis by a single-value metric of numParticipants, which I put as the a max aggregation.

I had 100 meetings in my test data. Add up all the number of meetings held with each number of participants (add up all my Y values) and I have 100.

If you don't have numParticipants as a field in your data, you always have the option of pre-processing that calculation at index time.


(michael) #3

How can I create a histogram myself? Which tool should I use?


(Tim Sullivan) #4

From what it initially sounded like you want, you're looking for a chart like the 2nd kind that I pasted in. It'd be best for simplicity on the Kibana side to have each document include a numParticipants field.

Can you share an example of what a document of yours looks like?


#5

Hi Tim,

thanks a lot for your detailed answer. Our documents in fact look like your first example:

{
  "@date": <some timestamp>,
  "meetingId": "meeting-12345",
  "participantId": "participant-98765",
  # maybe some other fields
}

The main problem seems how to derive a set of auxiliary documents looking like your second example:

{
  "@date": <some timestamp>,
  "meetingId": "meeting-12345",
  "numParticipants": 20,
  # maybe some other fields
}

This set corresponds to what I was referring to as "auxiliary mapping" (mapping in the sense of function). If we had these auxiliary documents explicitly, then we would be done, since then we could "bucket" on numParticipants and count the documents (as you did in your final histogram).

So, the questions are:

  • You gave a search query that can be typed in in the Kibana console, and in fact returns "buckets" that count the number of participants for each meeting. Can this query be used for a Kibana visualization (as some background code)?

  • You talked about "pipeline aggregation" in ElasticSearch. Is this something that could be done in a Kibana visualization also (after all we want to draw bars, and not do console queries)?

  • You said "If you don't have numParticipants as a field in your data, you always have the option of pre-processing that calculation at index time". Yes, we do not have that field. But I have no idea how this could be calculated "at index time" (do you mean in Logstash?). Would we need the ElasticSearch filter plug-in (since we do not have that available)?

Thanks for your support.


Multiple Group By in Time Series
(Tim Sullivan) #6

I looked around for quite a bit, and I couldn't find an aggregation type in Elasticsearch that can count the unique values of an aggregation output. I'll ask some more and see if I missed something in the documentation: https://www.elastic.co/guide/en/elasticsearch/reference/6.4/search-aggregations.html

Possibilities with pipeline aggregation also refers to the previous question. I'm not sure if there is a type of pipeline that can pipe the output of one aggregation and count the distinct values that came in.

I think you could use the Elasticsearch input plugin (not filter plugin) in Logstash: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-elasticsearch.html. The input query would be the aggregation, and you could re-index the results into a new index, then use the new index in Kibana for the visualization.

If you don't have that plugin for Logstash and are not able to get it, you can achieve the same with some custom scripts in a language like Python or NodeJS to run a query, format the results into new documents, and index those into Elasticsearch.


(Tim Sullivan) #7

I checked with some developers of Elasticsearch and they confirm that for now, there isn't a pipeline aggregation type that will do what we're looking for here.


#8

Hi Tim,

thanks a lot for digging into details. Since we are using a managed service for Elastic, I will ask our provider what tools are available at index time.

Best


(Tim Sullivan) #9

I hope that works out for you!

Note that I haven't tried the solution of using Elasticsearch input plugin myself. I'll try it out hopefully soon when I have some time. I did some research though, and it looks like someone was able to get it to work and index their aggregation buckets as documents in a new index: Aggregation Query possible input ES plugin


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.