Transform not synching

Hi trying to build my first transform - a latest transform. Maybe I misunderstand, but I expect to create and start the transform and it runs continuously - so new documents in the source index are processed and reflected in the destination.

I have created the transform, and when I start it, I see the result index and documents in it for the source index as it existed when the transform started.

New documents in the source index, however are not being processed. When I search the destination index I see no new documents being created and no existing documents updated.

When I GET the transform it looks like this

{
  "count" : 1,
  "transforms" : [
    {
      "id" : "latest-general-event-by-linkid-2021",
      "version" : "7.15.1",
      "create_time" : 1640125319914,
      "source" : {
        "index" : [
          "general-event-*"
        ],
        "query" : {
          "range" : {
            "timestamp" : {
              "gte" : "2021-01-01T06:00:00Z",
              "lt" : "2022-01-01T06:00:00Z"
            }
          }
        }
      },
      "dest" : {
        "index" : "latest-general-event-2021"
      },
      "frequency" : "1m",
      "sync" : {
        "time" : {
          "field" : "timestamp",
          "delay" : "60s"
        }
      },
      "latest" : {
        "unique_key" : [
          "linkId"
        ],
        "sort" : "timestamp"
      },
      "description" : "2021 Latest general event by linkId",
      "settings" : { },
      "retention_policy" : {
        "time" : {
          "field" : "timestamp",
          "max_age" : "550d"
        }
      }
    }
  ]
}

I tried removing the query from the transform but I get the same result.

Any ideas from this what I am doing wrong?

Many thanks - this community has been very helpful to me

You configured timestamp as you field for synchronization. It is important that this field contains a real, valid timestamp. Ensure the timezone of your client and Elasticsearch server do not mismatch.

You set delay to 60s. This means a data point can arrive up to 60s late. This compensates all kinds of ingest delays, be it transfer time, queuing, data accumulation and last but not least the index refresh interval of the lucene index you push the data to. 60s is conservative, if you know you push data faster or if timestamp is set by an ingest processor or by logstash, you can lower the value. On the flip side of delay, transform will not query for data from the last 60s. That means if you push a new record now, you have to wait 60s until transform will read it.

Why is transform doing this?

The way it works is called checkpointing and the reason for checkpointing is updating the index in an efficient way. Transforms uses query heuristics to keep the amount of data to be rewritten as small as possible. Transform does not brute-force re-create everything. You can find a description how it works in the docs.

I hope this helps.

Hi @iamtheschmitzer

Here is my working transform.

It track sthe last log message from 1000s of windows hosts.
It keep the last message / document based on the host.name field
It is continuous, runs every 10m, based on @timestamp with a 60s delay and keeps the data for 7 days. It works great.

{
  "count" : 1,
  "transforms" : [
    {
      "id" : "host-tracker-windows-10m",
      "version" : "7.14.2",
      "create_time" : 1638585746599,
      "source" : {
        "index" : [
          "windows-*"
        ],
        "query" : {
          "match_all" : { }
        }
      },
      "dest" : {
        "index" : "host-tracker-windows-10m"
      },
      "frequency" : "10m",
      "sync" : {
        "time" : {
          "field" : "@timestamp",
          "delay" : "60s"
        }
      },
      "latest" : {
        "unique_key" : [
          "host.name"
        ],
        "sort" : "@timestamp"
      },
      "description" : "host-tracker-windows-10m",
      "settings" : {
        "max_page_search_size" : 500
      },
      "retention_policy" : {
        "time" : {
          "field" : "@timestamp",
          "max_age" : "7d"
        }
      }
    }
  ]
}

I'm using the Dev Console and creating my own timestamps in UTC time. I don't have an @timestamp field in my document. Is this from filebeat (which I have in production)

How would I change this to compensate for innaccurate timestamps? Would you make the timestamps earlier in the day to update the index or later in the day? Or would you change the transform settings?

Filebeat should automatically create a @timestamp field unless you tell it not to or are over writing it.

Probably worth time getting that fixed.

What does your filebeat.yml look like.. What kind of files are you Harvesting?

This is a test using Dev console, so filebeat isn't running here

Okay, understood you're going to have to create reasonable data for the transform to work on aligned with how your transform runs... Which sounds like your trying to do ...

(This may be harder than real data... :slight_smile: )

So I would do something like this
Get rid of the query part of your transform (in fact use mine and just set the frequency to 1m to test)

Create a mapping and some docs.
Put in 3 or 4 docs with a recent timestamp like below (that is UTC)
Start the transform in continuous mode..

Then wait 3 mins or so then post those same docs with an updated timestamp and message...
The transform should pick them up .. and you should only have the latest docs.
ALSO you need to create the mapping for the transform destination index.

DELETE discuss-test

PUT discuss-test/
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "linkId": {
        "type": "keyword"
      },
      "message" : {"type": "text"}
    }
  }
}

POST discuss-test/_doc
{
  "@timestamp" : "2021-12-22T21:21:53Z",
  "link_id" : "1111",
  "message" : "first message"
}


POST discuss-test/_doc
{
  "@timestamp" : "2021-12-22T21:21:53Z",
  "link_id" : "2222",
  "message" : "first message"
}



POST discuss-test/_doc
{
  "@timestamp" : "2021-12-22T21:21:53Z",
  "link_id" : "3333",
  "message" : "first message"
}


POST discuss-test/_doc
{
  "@timestamp" : "2021-12-22T21:21:53Z",
  "link_id" : "4444",
  "message" : "first message"
}

I recommend this old but gold advent post from 3 years ago: Dec 12th, 2018: [EN][Elasticsearch] Automatically adding a timestamp to documents

It shows how to use a pipeline that adds an ingest timestamp to every document you push. That's the most accurate way you can use and it allows you to lower the value for delay. Note that by default lucene uses a refresh interval of 1s, so you should not set delay to a lower value, but e.g. use 2s. Of course you can tweak this further, but I guess that should already meet recency requirements for most users.

1 Like

@Hendrik_Muhs Nice! I know how to do that but it didn't want to write it all up. That's awesome!

@iamtheschmitzer Just set to field to @timestamp if that is what you want!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.