[HELP] Transform not accounting for documents with same timestamp

Hi, I am new to elasticsearch. I have a datastream that uses a custom timestamp field that is made by envoy, and fed thru filebeat to logstash (with grok) then to elasticsearch. The use case is that there are thousands of requests coming into a log, and many times these documents/requests share the same timestamp down to the millisecond.

I am trying to make a 1-minute continuous aggregation/transform, that groups by 1 minute and makes a sum of the total records that were in that 1 minute time period. I believe the transform is skipping/not recognizing documents that share the same timestamp, because the aggregate is always barely off ranging from a dozen to hundreds of records that were not 'counted'. I store this as 'num_docs' on my transform index.

Here is my 1 minute transform json:

{
  "id": "1_minute_access_logs",
  "authorization": {
    "roles": [
      "superuser"
    ]
  },
  "version": "10.0.0",
  "create_time": 1717458721029,
  "source": {
    "index": [
      "filebeat*"
    ],
    "query": {
      "match_all": {}
    }
  },
  "dest": {
    "index": "1_minute_access_logs"
  },
  "frequency": "30s",
  "sync": {
    "time": {
      "field": "timestamp",
      "delay": "5s"
    }
  },
  "pivot": {
    "group_by": {
      "user_account": {
        "terms": {
          "field": "user_account"
        }
      },
      "timestamp": {
        "date_histogram": {
          "field": "timestamp",
          "calendar_interval": "1m"
        }
      },
      "request_method": {
        "terms": {
          "field": "request_method"
        }
      },
      "response_code": {
        "terms": {
          "field": "response_code"
        }
      }
    },
    "aggregations": {
      "num_docs": {
        "value_count": {
          "field": "@timestamp"
        }
      },
      "timestamp_max": {
        "max": {
          "field": "timestamp"
        }
      },
      "timestamp_min": {
        "min": {
          "field": "timestamp"
        }
      },
      "response_time_avg": {
        "avg": {
          "field": "response_time"
        }
      }
    }
  },
  "description": "1_minute_access_logs",
  "settings": {},
  "retention_policy": {
    "time": {
      "field": "timestamp",
      "max_age": "70m"
    }
  }
}

This is my result from querying with postman on the 1 min:

"aggregations": {
        "total_num_docs": {
            "value": 96581.0
        }
    }

I expected 96,658

Please let me know if you need any more information - thank you!

Hey there! This is an interesting challenge. There's nothing built-in that can address this, this but there are a few caveated options:

  1. You could hash any part of the document or request you're storing and use that as an _id to count off of but that would be costly and not to your original plan of using the timestamps
  2. You could store the above hash within the document and use that to count, but again, costly and not your timestamp goal.
  3. I'm not sure if envoy can do this, but making the timestamps even more fine-grain to ns might address this. Probably more overhead.

The reason why there is no easy way to count / increase, etc is because that would be a singleton and we avoid that as much as possible. If the same documents are received by 2 different coordinating + ingest nodes, they would have no clue of the other one, and that would create problems.

Edit: there are some further questions/thoughts that might help pin down the cause.

Is it possible that the 5s delay is causing the missing docs?
Is this a true data stream or a timeseries data stream? Downsampling would help if it's the TSDS

I'd definitely recommend trying out some of these options with POST _transform/_preview.

I hope this sheds some light, and hope to see some other insights from other people as well.

1 Like

I have tried using more granular timestamps, but Elasticsearch only supports up to milliseconds. And yes, the aforementioned solutions would be too taxing.

I believe it is a true data stream, I will figure out how to make it a time series data stream for downsampling instead.

I will also try a larger delay - but I was under the impression that the documents are completed and ready whenever they register in the data stream in the first place, considering I am dumping them line by line.

Let's see what happens :+1: