[HELP] Transform not accounting for documents with same timestamp

Liam_Young · June 20, 2024, 5:49pm

Hi, I am new to elasticsearch. I have a datastream that uses a custom timestamp field that is made by envoy, and fed thru filebeat to logstash (with grok) then to elasticsearch. The use case is that there are thousands of requests coming into a log, and many times these documents/requests share the same timestamp down to the millisecond.

I am trying to make a 1-minute continuous aggregation/transform, that groups by 1 minute and makes a sum of the total records that were in that 1 minute time period. I believe the transform is skipping/not recognizing documents that share the same timestamp, because the aggregate is always barely off ranging from a dozen to hundreds of records that were not 'counted'. I store this as 'num_docs' on my transform index.

Here is my 1 minute transform json:

{
  "id": "1_minute_access_logs",
  "authorization": {
    "roles": [
      "superuser"
    ]
  },
  "version": "10.0.0",
  "create_time": 1717458721029,
  "source": {
    "index": [
      "filebeat*"
    ],
    "query": {
      "match_all": {}
    }
  },
  "dest": {
    "index": "1_minute_access_logs"
  },
  "frequency": "30s",
  "sync": {
    "time": {
      "field": "timestamp",
      "delay": "5s"
    }
  },
  "pivot": {
    "group_by": {
      "user_account": {
        "terms": {
          "field": "user_account"
        }
      },
      "timestamp": {
        "date_histogram": {
          "field": "timestamp",
          "calendar_interval": "1m"
        }
      },
      "request_method": {
        "terms": {
          "field": "request_method"
        }
      },
      "response_code": {
        "terms": {
          "field": "response_code"
        }
      }
    },
    "aggregations": {
      "num_docs": {
        "value_count": {
          "field": "@timestamp"
        }
      },
      "timestamp_max": {
        "max": {
          "field": "timestamp"
        }
      },
      "timestamp_min": {
        "min": {
          "field": "timestamp"
        }
      },
      "response_time_avg": {
        "avg": {
          "field": "response_time"
        }
      }
    }
  },
  "description": "1_minute_access_logs",
  "settings": {},
  "retention_policy": {
    "time": {
      "field": "timestamp",
      "max_age": "70m"
    }
  }
}

This is my result from querying with postman on the 1 min:

"aggregations": {
        "total_num_docs": {
            "value": 96581.0
        }
    }

I expected 96,658

Please let me know if you need any more information - thank you!

Justin_Castilla · June 20, 2024, 9:24pm

Hey there! This is an interesting challenge. There's nothing built-in that can address this, this but there are a few caveated options:

You could hash any part of the document or request you're storing and use that as an _id to count off of but that would be costly and not to your original plan of using the timestamps
You could store the above hash within the document and use that to count, but again, costly and not your timestamp goal.
I'm not sure if envoy can do this, but making the timestamps even more fine-grain to ns might address this. Probably more overhead.

The reason why there is no easy way to count / increase, etc is because that would be a singleton and we avoid that as much as possible. If the same documents are received by 2 different coordinating + ingest nodes, they would have no clue of the other one, and that would create problems.

Edit: there are some further questions/thoughts that might help pin down the cause.

Is it possible that the 5s delay is causing the missing docs?
Is this a true data stream or a timeseries data stream? Downsampling would help if it's the TSDS

I'd definitely recommend trying out some of these options with POST _transform/_preview.

I hope this sheds some light, and hope to see some other insights from other people as well.

Liam_Young · June 21, 2024, 12:32am

I have tried using more granular timestamps, but Elasticsearch only supports up to milliseconds. And yes, the aforementioned solutions would be too taxing.

I believe it is a true data stream, I will figure out how to make it a time series data stream for downsampling instead.

I will also try a larger delay - but I was under the impression that the documents are completed and ready whenever they register in the data stream in the first place, considering I am dumping them line by line.

Let's see what happens

Topic		Replies	Views
Transforms keep falling behind / not processing any documents Elasticsearch docker	2	53	July 24, 2024
Data not getting reflected in transform index Elasticsearch transforms	1	274	May 19, 2023
Continuous Transform Timestamp isn't a timestamp Elasticsearch	4	1341	July 27, 2020
Elasticsearch Transform not triggering Elasticsearch transforms	6	1329	May 20, 2021
Transform data mismatch with source index Elasticsearch transforms	3	534	August 12, 2022

[HELP] Transform not accounting for documents with same timestamp

Related topics