Hi Team,
We have configured Elasticsearch along with logstash.
Our setup is:
There is a data stream with the following mapping:
@timestamp (ingest generated by logstash)
{
"id": "1",
"activityTime": "2023-03-09T02:11:57.000Z",
"activity": "activity_1", //This will be unique values
"@timestamp": "2023-04-19T20:11:00.547341200Z"
}
We have a monthly transform:
PUT _transform/irm-activity-index-monthly-transform
{
"description": "IRM transform for monthly data",
"source": {
"index": [
"sourve_name"
]
},
"dest": {
"index": "dest_name"
},
"frequency": "60s",
"sync": {
"time": {
"field": "@timestamp",
"delay": "60s"
}
},
"pivot":
{
"group_by":
{
"activityTime": {
"date_histogram": {
"field": "activityTime",
"calendar_interval": "1M"
}
},
"activity": {
"terms": {
"field": "activity"
}
}
},
"aggregations": {
"totalActivitiesCount": {
"value_count": {
"field": "activity"
}
}
}
},
"retention_policy": {
"time": {
"field": "activityTime",
"max_age": "248d"
}
}
}
Point to note:
The sync is on @timestamp field and the date histogram on per month is configured on activityTime as given above.
@timestamp is generated by logstash. Also a point to note for miliseconds, In some cases it is upto 9 digit precision whereas in some cases it is upto 6 digit precision.
Activitytime is not sequential. Backdated activities can also come.
Issue:
Lets say we have following activites in the following format:
"activityTime" : "2023-01-01T04:33:15.000Z"
"activityTime" : "2023-01-02T04:33:15.000Z"
"activityTime" : "2023-01-29T04:33:15.000Z"
"activityTime" : "2023-01-30T04:33:15.000Z"
This get correctly aggregated.
But in next cycle if the following activity comes,
"activityTime" :"2023-01-15T04:33:15.000Z"
Then the transform contains only 3 entries, i.e. for 1Jan,2Jan and 15 Jan.
It does not consider 29th and 30th Jan while computing which is leading to incorrect aggregation result.