'Latest' Transform job not refreshing all docs in the destination index

Hi, I'm new to Transforms.

I have a Powershell script that every hour gathers VMware virtual machine capacity metrics (cpu, memory and such) from vCenter and stores them in an index: virtualisation-vm-yyyy-MM

I've created a simple latest continuous Transform Job that sends the latest records from the index-pattern virtualisation-vm-* to a destination index called virtualisation-latest-vm-vsphere. I thought it was working as expected but after a few days I've noticed some VMs in the latest destination index have not updated since the initial transform job was created even though in the original index new data has been added for those VMs every hour since the Transform was created

I've recreated the Transform job several times with different settings (freq, delay, etc) but with no difference. How do I debug this? There's no errors or warnings in the messages section of the Transform

again this does not affect all docs, the majority update themselves, but about 10% don't and it doesn't always seem to be the same ones that don't update when I recreate the transform job

Below are my settings:

(3 node cluster)

value={
  "id": "virtualisation-latest-vm-vsphere",
  "version": "7.16.2",
  "create_time": 1653387161554,
  "source": {
    "index": [
      "virtualisation-vm-*"
    ],
    "query": {
      "match_all": {}
    }
  },
  "dest": {
    "index": "virtualisation-latest-vm-vsphere"
  },
  "frequency": "60m",
  "sync": {
    "time": {
      "field": "@timestamp",
      "delay": "60s"
    }
  },
  "latest": {
    "unique_key": [
      "vm.keyword"
    ],
    "sort": "@timestamp"
  },
  "description": "Only the latest information from the virtualisation capacity data.",
  "settings": {
    "max_page_search_size": 500
  },
  "retention_policy": {
    "time": {
      "field": "@timestamp",
      "max_age": "30d"
    }
  }
}

From initial information, the most likely explanation would be that there is a divergence between @timestamp and the time of ingest (however I'm not sure how long a sync.time.delay you experimented with).

As a best practice, the sync time field should be the time of ingest. This is the best way for transforms to be able to identify changes since the last time it checked. This can be set using an ingest processor, something along the lines of:

PUT _ingest/pipeline/set_ingest_time
{
  "description": "Adds ingest timestamps",
  "processors": [
    {
      "set": {
        "field": "_source.@timestamp_ingest",
        "value": "{{_ingest.timestamp}}"
      }
    }
  ]
}

The best way to see if you are getting divergence between @timestamp and ingest time, would be to plot counts for both.

Also there could perhaps be errors relating to writing to the index virtualisation-latest-vm-vsphere in the Elasticsearch logs.

Hope this helps
Sophie

Thanks Sophie

Can I have more info on how I can set the Ingest Timestamp for my index-pattern or point me to help/tutorial page? I've never used pipeline processors before

I've managed to create the pipeline using Sophies example, but cant see anywhere to tell the pipeline to only apply that ingest timestamp fields to my original index-pattern "virtualisation-vm-*"

how does it know which indexes to apply the pipeline to? does it just apply it to all indexes?

There are a few examples here which explain how to use a pipeline when indexing documents into the *virtualisation-vm-* * index - which is the transform source index. Ingest pipelines | Elasticsearch Guide [8.2] | Elastic

Thank you Sophie. I've further experimented with the sync delay value and it seems to be working as expected now

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.