Ensure that @timestamp is set correctly, this is hard to get right with manual document pushes. I suggest to have a look into an ingest pipeline that sets the timestamp for you. You find an example in our docs
You set delay to 1ms - that's unrealistic.
A lucene index works near realtime, a document that gets pushed into an index is not immediately searchable. The default refresh_interval of an index is 1s. Did you change the default for test_transform_shlee? Even if so, I doubt 1ms is feasible. Transform sync.delay must be at minimum the configured refresh_interval of the source index. In addition you should add some room for the ingest pipeline communication. If you don't use an ingest timestamp, but e.g. set the timestamp in your application you have to account for all possible delays that can happen - e.g. an additional queue - between the occurence of the event until it is searchable.
Thank You for reply @Hendrik_Muhs
I'm glad to see you again
How did you set @timestamp?
->
my @timestamp is generated by logstash.
Do I have to use "ingest._timestamp" in "transform"?
You set delay to 1ms - that's unrealistic.
->
In fact, I think this is because I don't understand delay, frequency, pivot or latest clearly.
I read the official reference doc, but I couldn't understand it for sure.
Depending on my purpose*, Could you please specify a simple example?
( * my purpose : I want to implement a transform instance in which the transform index (dest) is updated every period of "frequency". )
In other words: Transform remembers by the timestamps which data is new. It does this by using range queries. It is important that the timestamp you configure in sync follows a real timestamp and is not just some arbitrary date field.
Example:
assume checkpoint 1 gets created with timestamp Sat, 01 Apr 2023 19:00:00 +0000.
on the next run checkpoint 2 is created, e.g. with Sat, 01 Apr 2023 19:01:00 +0000. For calculating the update it runs a query for all data between checkpoint 1 and checkpoint 2 or [Sat, 01 Apr 2023 19:00:00 +0000, Sat, 01 Apr 2023 19:01:00 +0000). It's important that all updates are within that range. It's not a problem if its after, the next checkpoint will take it. But if it's before it will miss the document.
This is were delay kicks in: When checkpoint 1 is created the real time is notSat, 01 Apr 2023 19:00:00 +0000, but e.g. Sat, 01 Apr 2023 19:00:10 +0000 or in other words, the real clock time minus delay. delay is important because between the happening of an event and the data being stored and searchable is not instant.
In your use case you have to set delay to a value that gives it enough room from the point in time where it is generated in logstash and available for search in elasticsearch. As said, a default index requires already 1s. Assuming you don't have long delays between logstash and elasticsearch you should be good with e.g. 3s, 5s if you want to be sure. However I don't know your setup, you have to find out yourself.
frequency controls how often the transform checks for new data and potentially creates a new checkpoint. As every checkpoint comes with some additional overhead, using a low frequency is more expensive. What's your requirement? Updating every second is possible depending on your data ingest rate and cluster setup. I would start with a higher frequency and test how much faster you can go. frequency can be updated, you don't need to re-create the transform to change it.
I'm sorry.
I was late to reply because I was working on another project.
@Hendrik_Muhs
Through the comments you've written so far
I understand the flow of 'transform:sync'.
Thank You!
And,
There is an additional question.
Can check point be set only by comparing the current time (now())?
└ example)
If the 'yesterday' data is now indexed in the source index(check point:a moment ago),
If you do not have a pipeline that creates a separate now() timestamp
Can't it be reflected in the transform dest?
(Did I understand it correctly?)
Is there a way to calibrate the check point manually?
Is it good to set delay to maximum to prevent data loss?
Question 1: Correct, once the checkpoint has passed the time, adding data that is older won't be reflected. That's why we suggest the usage of ingest timestamps. As an ingest timestamp is set at indexing you can be sure that it won't be too old. You can still index "yesterdays" data if you differ between the event and the ingest timestamp.
Question 2: No, we might add this in future but have no concrete plans at the moment.
Question 3: You should set delay to the worst case ingest delay that you can imagine in your setup. By using ingest timestamps this can be e.g. 5s, if you use the event timestamp, plan a larger delay. E.g. if a devices sends data every minute, your delay must be at least a minute plus additional room for network communication and ingest.
Transform starts reading the data when delay has passed. If you set delay to a maximum, transform wouldn't read the data for a very long time and therefore you're dashboards/consumers of the transform data wouldn't see the transformed data. With other words, delay controls latency in the destination index. The lower delay, the faster your transform is reflected in the destination index.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.