Dec 16th, 2019: [EN][Logstash] Indexing Github Events for Fun and Profit

Indexing Github Events for Fun and Profit

Idea

Like a lot of companies, we lean on Github for more than just git repo hosting. We use it to track work via tickets, asynchronously pair on code via PR reviews, and even use project boards for some Kanban goodness.

All well and good, but there are certain things Github just can't do. Like tell us whether the number of incoming critical bugs is trending up. Or give us hard numbers about who on the team is going above and beyond to review as many PRs as possible.

There are custom tools that can fetch data from Github via its API and assemble this kind of information for us. But we thought: What if we could just get this into Elasticsearch? Once there, could we build visualizations in Kibana to get the answers we need?

Architecture

Enter Logstash, and its Github input plugin.

Turns out Github will push everything that happens in your org -- a new PR opened, a new comment added to an issue, a new fork of a repo -- as separate events via a webhook. And if you point that webhook at a waiting Logstash instance with a properly configured Github input plugin, it'll catch that event and index it right into Elasticsearch.

Our implementation has 3 Logstash nodes running on a kubernetes cluster (easier to do with the new helm chart), with traffic distributed in a round-robin fashion between them. Each node has a persistent queue backed by persistent storage, so we don't lose events even if one of the pods restarts.

After doing some filtering and munging of the incoming data (see below), each event then gets indexed via the ES output plugin into an ESS cluster.

Challenges

As you can imagine, while simple in conception, the devil lay in the details.

It turns out Github sends a lot of data for each event. So much so that we easily hit the 1,000-field limit in elasticsearch after the first day of indexing.

And not only does Github send way too many fields, but it loves to deeply nest them, which means the username you're interested in searching for ends up nested three or four levels deep.

So our Logstash config ended up growing quite a bit. We use a combination of the mutate and prune filters to first copy over the nested fields we care about into top-level ones, and then drop everything else.

Here's an excerpt from our logstash.conf file:

filter {
  mutate {
    copy => {
	  "[headers][x-github-event]" => "github-event"
	  ...
      "[pull_request][author_association]" => "pull-request-author-type"
      "[pull_request][base][ref]" => "pull-request-base-ref"
      "[pull_request][base][repo][id]" => "pull-request-base-repo-id"
      "[pull_request][base][repo][name]" => "pull-request-base-repo-name"
      "[pull_request][base][repo][git_url]" => "pull-request-base-repo-url"
      "[pull_request][base][repo][owner][id]" => "pull-request-base-repo-owner-id"
      "[pull_request][base][repo][owner][login]" => "pull-request-base-repo-owner-name"
      ...
	}
  }

  prune {
    whitelist_names => ["_source", "github-event", "type", "action",
    "timestamp", "@timestamp", "id", "name", "body", "updated",
    "state", "title", "number", "created", "closed", "hash", "url",
    "count", "ref", "forced", "compare", "before", "after", "deleted",
    "commits", "merged", "sha", "submitted", "labels"]
  }
}

How is the whitelist so short? Because it'll match each string against the entire field name, so putting name in the whitelist is enough for both sender-name and pull-request-review-user-name to get through. It won't match against nested names, though, so repository.name won't get through.

Results

Our setup has been running for about 6 months now, and it's already become a critical part of the workflow for several teams.

One group uses a watcher alert to send them a Slack message every time a new issue gets created with a breaking tag.

Another is using it to track product work that crosses multiple repos:

In Infra (my team), we're using it to track things like the balance of work among different parts of a team:

...and who's keeping up with their PR reviews:

Hope this inspires you to dig into your own repo data!