The sync field is used with the high water mark and the delay to determine which records in the input index it is going to do the calculation for. Delay may need to be >= frequency to ensure it processes all records. It may be querying for sync > (now - delay) every frequency interval - sync > (now - 60s) every hour - in your case, which would miss 59 minutes worth of data.
What I think happens when it runs is that it groups all of the input records - the date histogram for the @timestamp works by whole calendar days. So it works out that all of the input records are on the same day. Then it does a query against the index for all records that occurred on that day, groups them and writes the resulting documents to the output index, creating new versions (or maybe maybe updating) those documents as it goes.
if the frequency is 1 hr. when it runs transform at 10. it consider the records greater than 9:59.
it will loose 9 to 9:59 data ? is my assumption is correct ?
when it runs at 10 or 11 or 12, it should consider previous records from the source index. in this time will it consider the missing data ?
if sync field is not @timestamp, then what is the formaula for delay value( if the sync field is another time field and it is having 30 min difference to present time)
delay >= frequency + 30 min ?
frequency defines how often transform looks for new data and in case of a failure how quick it re-tries. This setting only defines scheduling, this setting has no impact on how the data is transformed. With other words, using different frequencies does not lead to different data.
sync and the sub-setting delay does not impact how data is transformed, assuming you set it up correctly: delay defines the ingest delay, it means: "When is it safe to query for data?". The time used for the timestamp field can have different delays, e.g. if you feed the timestamp from an external system, it might be, that you batch data and send it e.g. every 5 minutes. For this case delay must be 5 minutes plus whatever it takes to transfer the data over the network and index it in elasticsearch (refresh_interval).
A continuous transform works in 2 steps:
identify the data points that need to be updated
re-create the data points that needed to be updated.
If you configure sync with a delay to low, step 1 might miss data points to be updated.
Regarding your example:
When it runs the transform at 10, assuming it run it the last time at 9.
query the source between 9 and 10 and e.g. identify that a and b have been changed (but not c and d).
query the sourcetill 10 filtered by a and b and update the documents for a and b.
Note:
in step 2 it queries all data
a and b can be terms but also ranges if you are grouping by date_histogram
if the query runs at 10 it does not query "lower than 10" but "lower than 10 - delay", accordingly the range is step 1 is 9-delay <= x < 10-delay
So again both sync and frequency do not affect how data is transformed but only how and when it is updated. The transformation is defined in your group_by. If you are not getting the expected it results, you might have a problem with setting up sync or a problem in your data.
I think I can better help you if you post your config and explain what you expect.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.