Transformed index is missing data

We have a transform that creates aggregation based on userId and clientID from a source index and write to a destination index. We've noticed the destination index is missing data it should have even after waiting 10+ minutes (sync frequency is set as 1m with sync delay also 1m). We have some data from similar timerange in destination while others are missing. What could the reason for missing data?

Below is transform definition (I renamed index name for sharing)

{
  "source": {
    "index": "source_index",
    "query": {
      "bool": {
        "must_not": [
          {"term": {"userId": ""}}
        ]
      }
    }
  },
  "dest": {
    "index": "dest_index",
    "pipeline": "add_timestamps_v2"
  },
  "pivot": {
    "group_by": {
      "client_id": {
        "terms": {
          "field": "client_id"
        }
      },
      "user_id_hash": {
        "terms": {
          "field": "userId"
        }
      }
    },
    "aggs": {
      "devices": {
        "terms": {
          "field": "device_id"
        },
        "aggs": {
          "users": {
            "terms": {
              "field": "userId"
            }
          }
        }
      }
      // more aggregations here..
    }
  },
  "frequency": "1m",
  "sync": {
    "time": {
      "field": "updated_at",
      "delay": "60s"
    }
  }
}

Welcome to our community! :smiley:

How are you identifying the missing data?

Hi @warkolm we do have a log in our application for missing data, and I confirmed that data is missing in destination index by searching with client_id and user_id

Which field are you using for sync and how is the timestamp created?

frequency controls how often transforms looks for new data and/or retries after a failure.
sync.delay compensates ingest delays, meaning it defines how long transform waits for new data to arrive late and/or out of order

I assume your problem could be the setting for sync.delay as you stated you waited 10+ minutes, but that does not matter. If a checkpoint is created it takes all the data that is available at that time, if data comes in late, it is not taken into account. It's like missing a flight, while the next ones are all fully booked.

Example: Assume sync.delay is set to 1m. When a checkpoint is created all data between (old_checkpoint, now() - 1m] is queried and processed. If data that falls into that range arrives later, it is neither part of this nor the next checkpoint, because for the next checkpoint it is considered too old.

I suggest you investigate whether sync.delay is set correctly for your use case. I assume you have to increase it.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.