Date Histogram doesn't work with calendar_interval month

I have tried to use a group_by with a date_histogram by month. Our example works like expected with "calendar_interval": "day" but returns nothing with "calendar_interval": "month". In fact it seems wo work with day, week and hour but not for month.

Sample Data

[
  {
    "productID": 1001,
    "accountID": "1",
    "timeStamp": "2023-02-28T07:00:15.043331859Z"
  },
  {
    "productID": 1002,
    "accountID": "1",
    "timeStamp": "2023-02-28T07:00:15.043331859Z"
  },
  {
    "productID": 1001,
    "accountID": "2",
    "timeStamp": "2023-02-28T07:00:15.043331859Z"
  }
] 

Transform

Group by month and accountID.

{
  "source": {
    "index": "product-tracking-events*"
  },
  "pivot": {
    "group_by": {
      "time": {
        "date_histogram": {
          "field": "timeStamp",
          "calendar_interval": "month"
        }
      },
      "account": {
        "terms": {
          "field": "accountID.keyword"
        }
      }
    },
    "aggregations": {
      "productIDs": {
        "cardinality": {
          "field": "productID",
          "precision_threshold": 1000
        }
      }
    }
  },
  "frequency": "1m",
  "sync": {
    "time": {
      "delay": "60s",
      "field": "timeStamp"
    }
  },
  "description": "Saves number of unique productIDs for an accountID",
  "dest": {
    "index": "transform-product-events"
  }
}

Expected Result

We want to see the number of unique products per account. So account 1 had 2 products and account 2 had 1 product:

[
  {
    "date": "2023-02-01T00:00:00.0Z",
    "account": "1",
    "productIDs": 2
  },
  {
    "date": "2023-02-01T00:00:00.0Z",
    "account": "2",
    "productIDs": 1
  }
]

Your steps look ok to me and I could not reproduce it.

Could you post how accountID is mapped? You specify accountID.keyword, I guess your problem has something to do with multi fields.

For debugging you could check what happens if you use ordinary aggs and/or a composite aggregation:

GET product-tracking-events*/_search
{
  "size": 0,
  "aggs": {
    "c": {
      "composite": {
        "sources": [
          {
            "time": {
              "date_histogram": {
                "field": "timeStamp",
                "calendar_interval": "month"
              }
            }
          },
          {
            "account": {
              "terms": {
                "field": "accountID.keyword"
              }
            }
          }
        ]
      }
    }
  }
}

Does the problem persist regarding day vs. month?

In both transform preview as well as composite aggs you can try what happens if you set missing_bucket to true like this:

            "account": {
              "terms": {
                "field": "accountID.keyword",
                "missing_bucket": true
              }
            }

@Tobse How far back does your Data go?

For testing we used only data from the same day. The timestamps were from 14 pm to 17pm.

mapping

The mapping is:

{
  "mappings": {
    "_doc": {
      "properties": {
        "accountID": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "productID": {
          "type": "long"
        },
        "timeStamp": {
          "type": "date"
        }
      }
    }
  }
}

missing bucket

The field accountID.keyword is a mandatory field. We tried it with "missing_bucket": true which made no difference.

composite aggs

When we use a composite aggs with months or days it works like expected. We get the grouped data.

Thanks @Tobse

what is your version of the stack, so I can try to reproduce it on the same version?

In the example data your dates were all in february. Does that fit with the data you are testing with?

By default a date_histogram in transform only creates a bucket if the bucket is complete. With other words, a bucket for February gets created, but a bucket for March will not be created today - as we have March 30th - but it will be created the earliest April 1st. The same applies for days, a bucket for today would not be created today, because the day isn't over.

If you want interim buckets, so filling buckets although they are not complete yet, you can set the parameter align_checkpoints to false, like this:

"settings": {
  "align_checkpoints": false
}

This setting will cause frequent updates to the documents in the transform destination index, of course only if an update is necessary, still this is more expensive. But you can see the monthly results while the month isn't over yet.

We are using ES version 7.17.7 at the moment.

@Hendrik_Muhs Thanks for the hint. Yes, you were right, we missed the settings.align_checkpoints flag. We are trying to transform the data while it is generated, because we are not able to store a whole month of data.

Our goal is to aggregate the unique-IDs count for the current month and then automatically roll over to the new month once it begins while beeing able to see the data of the current month.

We included "settings.align_checkpoints": false in our current config, but now, after letting it run some time we seem to have the Problem that the buckets seem to be created correctly, but the productIDs aggregation is allways 0

"_source": {
    "productIDs": 0,
    "time": "2023-03-01T00:00:00.000Z",
    "account": "a1"
}

Any idea what the Problem here could be?

That's a problem. When transform re-creates a bucket, it recalculates it. All data necessary for that bucket must still be retrievable. With other words, at least the last month must be there and you can only delete a month once it's fully over.

Your usage of cardinality illustrates the problem. While a min, max, sum would be possible to calculate without historic data, cardinality requires an approximate data structure called HyperLogLog++. Think of it as a large binary array. We don't store this binary array, but only the final output. That means every time the bucket has to be re-calculated, this array gets re-created to calculate the cardinality from it. Technically speaking a min could be calculated by taking the min of the current and new value, but because we support many aggregations, including cardinality and scripted_metric, the continuous mode works by always re-calculating the bucket.

If you don't have the resources to store a full month of data I only see the option to decrease the time window of the date_histogram, maybe you can store at least a week.

For cardinality however that's a problem again. If you have 4 documents for 4 weeks and a cardinality value for each week, it is impossible to calculate the cardinality of a month, because you can't tell from 4 numbers, how many of them overlap.

Possible solution

What I can think of is creating stacked transforms, 2 transforms chained together. transform 1 summarizes data daily with a group_by on timestamp, accountID and productID. As aggregation you could count the number of docs that match that combination:

 "count" : { "value_count" : { "field" : "productID" } }

My hope is, that this reduces the source data significantly, which only works if you have tons of documents per day, account and product. Note, this transform does not use a cardinality agg.

Given that transform you can delete the source data for a day after that day is over.

transform 2 could run on the destination of transform 1 and do your monthly summary. You need to keep the data from the 1st transform for a month, but as it got reduced, that's hopefully within your storage budget.

The 1st transform's destination index could use a retention_policy to delete data after a month.

(FWIW To safe storage, disable _source for both transforms)

1 Like

Thanks a lot for your detailed explanation :+1: This helped a lot!
I thought, that in continuous mode, the magical HyperLogLog will be stored or even held in memory. So it would be possible to index previous data and even update it with live data. To have this option, would be a real killer feature.

Our simple use case: Count the number of different products a users accesses per month.
Sample answer: User Alex loads 4.921.465 unique products in Mai. The total number of loaded products in Mai were 306.357.360 (real numbers).

To also have a live preview for the month would be great, but fortunately in our case it's enough to calculate it afterwards. I wanted to skip the scanning of 600Gb historical data per month, because it takes a lot of time and produces a heavy load, to access this "cold" data (I know I reduce to load with loading limits).

In conclusion I need to disable the continuous mode for the transformation and optimize and fine-tune the aggregation of historical data.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.