Ingest pipelines with Serial Difference aggregation

dg1 · July 16, 2020, 4:49pm

I'm storing quality metrics in elastic. One of those metrics is code coverage (extracted from sonar). I want to calculate the difference in average code coverage for a project between weeks. So this week the average code coverage might be 80% and last week it was 90% for that same project. So the difference is -10%.

I've found that serial differencing aggregations are perfect for calculating this and I have a working query. (Shared query syntax below). However I need a UI to display these query results to users and I don't plan to build one right now. I've found that the only way to display serial difference query results in Kibana is with TSVB BUT that'll only work if I do it as a histogram. I want a table view like this:

Project name % Coverage %change from last week
Project 1 80% -10%
Project 2 70% 5%

A trend arrow in that % change column would be awesome. I couldn't get the trend arrows to work in TSVB either or at least they don't seem to do what I want.

So here's what I think I need as a solution although it'd be nice to hear there's an easier approach.

I need to calculate the % change in coverage from the previous week at the moment it's created. Doing so will allow me to use kibana's table visualization out of the box. If I create an ingest pipeline, I think I could then:

Extract the @timestamp field out of the incoming metric.
Subtract 7 days from it to figure out the date on the week prior and use it to build the time range in the serial difference query.
Extract the project name out of the incoming metric so I'd know which project to query for in the serial difference query
Run the serial difference aggregation inside the ingest pipeline.
Extract the value percent_change.value field on the newest bucket if a value exists.
Append a new field to the document called percent_change_from_last_week in the document to be written.

Is there an example ingest pipeline out there that closely matches what I'm trying to do?

Here is my query:

    POST /sonarmetrics/_search
    {
      "size": 0,
      "aggs": {
        "2": {
          "terms": {
            "field": "key.keyword"
          },
          "aggs": {
            "my_date_histo": {                  
               "date_histogram": {
                  "field": "@timestamp",
                  "calendar_interval": "week"
               },
               "aggs": {
                  "avg_coverage": {
                     "avg": {
                        "field": "coverage"     
                     }
                  },
                  "percent_change_from_last_week": {
                     "serial_diff": {                
                        "buckets_path": "avg_coverage",
                        "lag" : 1
                     }
                  }
               }
            }
          }
        }
      },
      "query": {
        "bool": {
          "must": [],
          "filter": [
            {
              "match_all": {}
            },
            {
              "match_phrase": {
                "name.keyword": "Project1" // Value needs to come from name field in metric
              }
            },
            {
              "range": {
                "@timestamp": {
                  "gte": "2020-07-08T19:29:12.054Z", // Needs value from @timestamp minus 1 week
                  "lte": "2020-07-15T19:29:12.055Z", // Should be the value from @timestamp
                  "format": "strict_date_optional_time"
                }
              }
            }
          ],
          "should": [],
          "must_not": []
        }
      }
    }

Here is an example result:

    {
      "took" : 3,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 153,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "2" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "key" : "Project1",
              "doc_count" : 153,
              "my_date_histo" : {
                "buckets" : [
                  {
                    "key_as_string" : "2020-07-06T00:00:00.000Z",
                    "key" : 1593993600000,
                    "doc_count" : 98,
                    "avg_coverage" : {
                      "value" : 76.21224562975826
                    }
                  },
                  {
                    "key_as_string" : "2020-07-13T00:00:00.000Z",
                    "key" : 1594598400000,
                    "doc_count" : 55,
                    "avg_coverage" : {
                      "value" : 75.94545371315697
                    },
                    "Percent_Change" : {
                      "value" : -0.2667919166012922 // This is the value I want to extract into a field called percent_change_from_last_week
                    }
                  }
                ]
              }
            }
          ]
        }
      }
    }

dg1 · July 16, 2020, 5:26pm

One minor update. I now realize I could eliminate the need to extract the @timstamp by just using date math. That's close enough.

"gte": "now-1w", // Needs value from @timestamp minus 1 week
"lte": "now",

whatgeorgemade · July 17, 2020, 8:56am

This boils down to being able to query Elasticsearch from an ingest pipeline, which can only be done using the enrich processor, which is very limited at the moment.

You won't be able to do what you want in an ingest pipeline. Doing it in Logstash using the elasticsearch filter may work but the best place to do it is in your application at the point of indexing.

dg1 · July 27, 2020, 4:02pm

Thanks George. It wasn't what I was hoping for but it keeps me from further barking up the wrong tree. It's too bad that the Trend Arrows in TSVB table view don't / (can't?) do what I want here. It almost seems like that's what they were intended for. I had come across something indicating they were broken in some cases until a few days ago but not sure if I can get the fix on the elastic cloud hosted solution right now either.

TerryStevens · August 8, 2020, 6:16am

I had come across something indicating they were broken in some cases until a few days ago but not sure if I can get the fix on the elastic cloud hosted solution right now either.

TerryStevens · August 9, 2020, 12:21pm

Anyone ?

system · September 6, 2020, 12:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Timeshift functionality in Elasticsearch Elasticsearch	10	4795	July 6, 2017
New to ES/Compare daily data to previous date's data Elasticsearch	7	5448	May 1, 2018
Finding percentage difference/deviation using 2 queries Elasticsearch	4	1374	July 6, 2017
Delta based on running total Elasticsearch	3	956	April 27, 2018
Create Kibana visualizaion using Serial Diff Aggregation Kibana	4	3057	July 6, 2017

Ingest pipelines with Serial Difference aggregation

Related topics