Rollup - Sum Bucket Aggregation results are buggy

This query works fine on a regular index, but when I run it against a rolled up index the resulting values are about 66% of what I'd expect. The rolled up index is also missing the sum_bucket (not to be confused with the Chum Bucket ;).

I did some tests for these days in particular verifying that the sum of the bytes fields match when comparing the raw index to the rolled up index.

Query:

{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        { "range": { "@timestamp": { "gte": "2018-03-27", "lte": "2018-03-28" } } },
        {"term": { "account": "27937601" } }
      ]
    }
  },
  "aggs": {
    "bytes_by_interval": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "24h"
        },
        "aggs": { "bytes_per_period": { "sum": { "field": "bytes" } } }
    },
    "sum_bytes": { "sum_bucket": { "buckets_path": "bytes_by_interval>bytes_per_period" } }
  }
}

Result of running it against a regular index.

{
  "took": 28,
  "timed_out": false,
  "_shards": {
    "total": 6,
    "successful": 6,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1044898,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "bytes_by_interval": {
      "buckets": [
        {
          "key_as_string": "2018-03-27T00:00:00.000Z",
          "key": 1522108800000,
          "doc_count": 604052,
          "bytes_per_period": {
            "value": 397239457230
          }
        },
        {
          "key_as_string": "2018-03-28T00:00:00.000Z",
          "key": 1522195200000,
          "doc_count": 440846,
          "bytes_per_period": {
            "value": 284955659441
          }
        }
      ]
    },
    "sum_bytes": {
      "value": 682195116671
    }
  }
}

Result of running it against a rolled up index.

{
  "took": 1,
  "timed_out": false,
  "terminated_early": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "bytes_by_interval": {
      "meta": {},
      "buckets": [
        {
          "key_as_string": "2018-03-27T00:00:00.000Z",
          "key": 1522108800000,
          "doc_count": 434303,
          "bytes_per_period": {
            "value": 262840318259
          }
        },
        {
          "key_as_string": "2018-03-28T00:00:00.000Z",
          "key": 1522195200000,
          "doc_count": 315924,
          "bytes_per_period": {
            "value": 190073569431
          }
        }
      ]
    }
  }
}

Let me know if there's anything else you need.

Hmm, can you post your Rollup job config too? I'd like to check the interval and delay (if configured). Do you have any other jobs configured for the same destination rollup index, or is there just one job?

Is the query (via the RollupSearch API) going to only the rollup index, or both rollup + raw data?

I think the sum_bucket missing is just a bug where we haven't implemented it yet, but don't throw an exception. The incorrect sum values is more concerning though.

Oh, would you mind checking to see if the results are correct/incorrect when you remove or or both of the query filters (the range and term)? I'm wondering if it might be related to how we're filtering the rollup data.

Thanks!

Below is the rollup job config. This is the only job configured for the ms_24h rollup index. Yes, the query is hitting just the one index (not both raw & rollup).

Given that the sum_bucket missing on the rollup, it's not practical to compare the raw vs. rolled-up results without the range query filter because this index spans roughly a years worth of data. But when I remove the term query filter I still have the same problem.

I did this with a 24 hour rollup; I'll test with a 5 minute roll up now and see if I have the same problem.

PUT _xpack/rollup/job/ms_24h
{
    "index_pattern": "mediaserver_20*",
    "rollup_index": "ms_24h",
    "cron": "*/10 * * * * ?",
    "page_size" :10000,
    "groups" : {
      "date_histogram": {
        "field": "timestamp",
        "interval": "24h",
        "delay": "10m"
      },
      "terms": {
        "fields": ["account", "stream", "streamtype", "host", "http_host", "clientip_n_agent", "geoip.region_name", "geoip.country_name", "geoip.continent_code", "cache_status", "response", "verb"]
      },
      "histogram": {
        "fields": ["request_time"],
        "interval": 1
      }
    },
    "metrics": [
        {
            "field": "bytes",
            "metrics": ["sum"]
        },
        {
            "field": "request_time",
            "metrics": ["avg","min","max"]
        }
    ]
}

I've done some Visualizations on this 24 hour Rollup index and they all return understated results - approximately 25% lower over a longer period of time. I'm running a Rollup job to create a 5 minute rollup - I'll run these same tests. It could be an issue with 24 hours?

Hm. I don't think it's the 24h interval... I was wanting to see if you were using something like 1d in the config (calendar time) and 24h in the query (fixed time). The difference in fixed vs interval could show some discrepancy. We're adding a check for this so users aren't bitten by the difference in calendar vs. fixed. Doesn't appear to be the case here though.

Good to know about the term filter, that helps eliminate that as a potential problem.

I'm assuming this testing is with 6.3? I'm wondering if you're running into the ID collision issue we just fixed. Basically, we did a bad job with the ID in 6.3 and it wasn't wide enough, which could lead to collisions in large enough indices. How big is your rollup index? Collisions would manifest as incorrect, under-reported errors like this.

Would it be possible for you to retry the tests in 6.4? We fixed Rollup to use 128bit IDs. Alternatively, you could rerun the job with an interval larger than 24h (or a job with fewer terms)... that will generate fewer docs which leads to fewer collisions. Might give us some idea if that's the problem or not.

Did you notice any errors/warnings in the log that the job may have generated?

Thanks for helping to diagnose this! Really appreciate the detailed responses so far :slight_smile:

Yes, these tests were on ver. 6.3. Yesterday we converted our cluster to 6.4 and I tested again with the same result in that the resulting values are about 66% of what I'd expect.

This index has 2.3 million docs.

I have to assume that the old index is still using 32bit IDs. So, I'm creating a new 1hr index and I'll test again and I'll let you know what I find. Btw: no errors/warnings show up in the /var/log/elasticsearch logs.

I could give you ssh access to our cluster if you want to have a look. The cluster has both production indexes and these rollups that I'm testing.

Yeah that's correct. The old IDs remain, so any collisions that occurred will be permanent in the index. If this is the root of the issue we'll only see a change after you completely rebuild the job.

Let's see what the new job says, and then go from there. IIRC, the chance of at least one collision with the 32bit IDs is virtually guaranteed at 200k docs, so at 2.3m you probably have more than a few.

I have some more ideas for debugging but they are going to get increasingly fine-grained / invasive :slight_smile:

SSH might be a bit dangerous, we try not to touch production clusters to limit accidents :slight_smile: If you're comfortable sharing data, a snapshot of the index would be the best option. My email is zach@elastic.co, if you want to hash out details in private over email.

OK, now that I've upgraded to 6.4 (and rebuilt my rollup index)I get the correct rollup_results!! I needed the 128bit ID's. Yay!

Thanks for your help Zachary!

What about the sum_bucket missing? Any idea if that functionality will be implemented any time soon?

adcc08a87d810d4d

Glad to hear that was the issue! The alternative was some kind of insidious bug which would be tricky to find.

Sorry for the bug in the first place, though! Definitely not one of my better moments letting that code ship. :confused:

Not sure, I'll look into that today. Pipeline aggs should be compatible with rollup, I think we're just missing adding them back to the query after we do our internal rewriting shenanigans.

If it's a quick fix I'll open a PR, otherwise I'll open a ticket you can track. More info soon.

Quick update here: I think adding support for pipelines is doable, although it's a little trickier than I was hoping. Conceptually they run after regular aggregations so it should be easy in theory... but in practices they are embedded in the agg tree and the reduction order is a little complicated.

Anyhow, I'm working on a PR and think I can get this working sooner than later.

Thanks Zachary! This is lower priority for us (more of a nice to have) but getting this functionality would definitely be appreciated.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.