"sync" command in Transform API

SEUNGHYO · March 30, 2023, 4:43pm

When a new document is indexed
I want to implement a transform instance in which the transform index (dest) is updated every period of "frequency".

This is the query I executed.

PUT _transform/test_transform_instance
{
  "source": {
    "index": ["test_transform_shlee"]
  },
  "dest": {
    "index": "test_tf_shlee_20230330"
  }, 
  "sync": {
    "time": {
      "delay": "1ms", 
      "field": "@timestamp"
    }
  },
  "frequency": "1s", 
  "settings":{
    "deduce_mappings": true
  },
  "pivot": {
    "group_by": {
      "channel": {
        "terms": {
          "field": "channel"
        }
      }
    }, 
    "aggregations": {
        "aggfield": {
          "terms": {
            "field": "aggfield",
            "size": 3
          }
        }
        }
      }
    }
  },
  "description": "20230330"
}

POST _transform/test_transform_instance/_start

I create a transform like this.
In _stat, "state" is "started".

and

Index the new document and I check the _stat again,
I can see that only "operations_behind" is incremented and not updated on dest.

Hendrik_Muhs · March 30, 2023, 7:06pm

How did you set @timestamp ?

Ensure that @timestamp is set correctly, this is hard to get right with manual document pushes. I suggest to have a look into an ingest pipeline that sets the timestamp for you. You find an example in our docs

You set delay to 1ms - that's unrealistic.

A lucene index works near realtime, a document that gets pushed into an index is not immediately searchable. The default refresh_interval of an index is 1s. Did you change the default for test_transform_shlee? Even if so, I doubt 1ms is feasible. Transform sync.delay must be at minimum the configured refresh_interval of the source index. In addition you should add some room for the ingest pipeline communication. If you don't use an ingest timestamp, but e.g. set the timestamp in your application you have to account for all possible delays that can happen - e.g. an additional queue - between the occurence of the event until it is searchable.

SEUNGHYO · March 31, 2023, 4:49am

Thank You for reply @Hendrik_Muhs
I'm glad to see you again

How did you set @timestamp?
->
my @timestamp is generated by logstash.
Do I have to use "ingest._timestamp" in "transform"?
You set delay to 1ms - that's unrealistic.
->
In fact, I think this is because I don't understand delay, frequency, pivot or latest clearly.
I read the official reference doc, but I couldn't understand it for sure.
Depending on my purpose*, Could you please specify a simple example?

( * my purpose : I want to implement a transform instance in which the transform index (dest) is updated every period of "frequency". )

Hendrik_Muhs · April 1, 2023, 7:36pm

Checkpointing is explained here.

In other words: Transform remembers by the timestamps which data is new. It does this by using range queries. It is important that the timestamp you configure in sync follows a real timestamp and is not just some arbitrary date field.

Example:

assume checkpoint 1 gets created with timestamp Sat, 01 Apr 2023 19:00:00 +0000.
on the next run checkpoint 2 is created, e.g. with Sat, 01 Apr 2023 19:01:00 +0000. For calculating the update it runs a query for all data between checkpoint 1 and checkpoint 2 or [Sat, 01 Apr 2023 19:00:00 +0000, Sat, 01 Apr 2023 19:01:00 +0000). It's important that all updates are within that range. It's not a problem if its after, the next checkpoint will take it. But if it's before it will miss the document.

This is were delay kicks in: When checkpoint 1 is created the real time is not Sat, 01 Apr 2023 19:00:00 +0000, but e.g. Sat, 01 Apr 2023 19:00:10 +0000 or in other words, the real clock time minus delay. delay is important because between the happening of an event and the data being stored and searchable is not instant.

In your use case you have to set delay to a value that gives it enough room from the point in time where it is generated in logstash and available for search in elasticsearch. As said, a default index requires already 1s. Assuming you don't have long delays between logstash and elasticsearch you should be good with e.g. 3s, 5s if you want to be sure. However I don't know your setup, you have to find out yourself.

frequency controls how often the transform checks for new data and potentially creates a new checkpoint. As every checkpoint comes with some additional overhead, using a low frequency is more expensive. What's your requirement? Updating every second is possible depending on your data ingest rate and cluster setup. I would start with a higher frequency and test how much faster you can go. frequency can be updated, you don't need to re-create the transform to change it.

SEUNGHYO · April 27, 2023, 5:11am

I'm sorry.
I was late to reply because I was working on another project.

@Hendrik_Muhs
Through the comments you've written so far
I understand the flow of 'transform:sync'.
Thank You!

And,

There is an additional question.

Can check point be set only by comparing the current time (now())?
└ example)
If the 'yesterday' data is now indexed in the source index(check point:a moment ago),
If you do not have a pipeline that creates a separate now() timestamp
Can't it be reflected in the transform dest?
(Did I understand it correctly?)
Is there a way to calibrate the check point manually?
Is it good to set delay to maximum to prevent data loss?

Thank you every time.

Hendrik_Muhs · April 27, 2023, 7:30am

Question 1: Correct, once the checkpoint has passed the time, adding data that is older won't be reflected. That's why we suggest the usage of ingest timestamps. As an ingest timestamp is set at indexing you can be sure that it won't be too old. You can still index "yesterdays" data if you differ between the event and the ingest timestamp.

Question 2: No, we might add this in future but have no concrete plans at the moment.

Question 3: You should set delay to the worst case ingest delay that you can imagine in your setup. By using ingest timestamps this can be e.g. 5s, if you use the event timestamp, plan a larger delay. E.g. if a devices sends data every minute, your delay must be at least a minute plus additional room for network communication and ingest.

Transform starts reading the data when delay has passed. If you set delay to a maximum, transform wouldn't read the data for a very long time and therefore you're dashboards/consumers of the transform data wouldn't see the transformed data. With other words, delay controls latency in the destination index. The lower delay, the faster your transform is reflected in the destination index.

SEUNGHYO · April 27, 2023, 7:52am

I understand perfectly.
I'll use it well.

Thank you very much @Hendrik_Muhs !
hope to see you again sometime.
Have a great day!

system · May 25, 2023, 7:53am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using Transform for document count when document updated Elasticsearch transforms	3	79	August 27, 2024
Transform not synching Elasticsearch transforms	9	1808	January 19, 2022
Continuous transform of a transform destination index Elasticsearch transforms	1	112	May 16, 2024
Understanding continous transforms syncing Elasticsearch transforms	5	4238	November 19, 2020
Impact of frequency value for continuous transform Elasticsearch transforms	3	176	July 11, 2024

"sync" command in Transform API

Related topics