Transform index is not updating the records in destination index in batch process

I have created transorm index in kibana with aggregations. The records are updated to destination index. I have updated the frequesncy as 10m and started the transform as below.
POST _transform/test-transform/_update
{
"frequency": "2m"
}

POST _transform/test-transform/_start

New records were added to source index. but the transform is not updated the aggregations to destination index.
Can some one please help

A batch transform is a 1-off operation. For continuously updating the destination index, you need the continuous mode. Continuous Mode requires a time field, please have a look in the docs. Another source of information is the e-commerce example, point 4, it explains Batch vs. Continuous mode.

Hope that helps!

Thanks Hendrik for looking into this issue.

I have selected the mode as continuous and selected the time field.

I have streamed some files to the index and created transform on that index. In the transform index the aggrgation by group id, the records are showing. I have selected the continous mode and the time field given delay as 60s
I have streamed some more files in to the source index. and i have checked after some time, the destination index is not updated.

The aggrgation i have used is scripted metric. in Preview of the transform it is displaying.
In Transform details the processed documents are not updated. last checkpoint time also not updated.

Can you please tell if am mising something here.

Hi,

can you post your config and some sample data? How did you feed the data in? What timefield did you use and what does it contain, e.g. is it an ingest timestamp?

Hi Hendrik,

The data is adding to the destination index. when i use the timestamp field with delay as 60 seconds.

My requirement is like i want to perform the aggregations on last day data and transform to destination index. if i give the frequency 24 hours will thecheck point run daily once. ?

As i am groping by the groupname, will the data will be updated as the group name is same ?

Yes, with frequency you can control how often transform runs and a setting of 24h will let it run only once a day. However, you can not not control when. This is a known limitation, better scheduling for usecases like yours are on our backlog. Frequency also controls how often transform retries after a failure. I therefore do not suggest to use frequency for that.

I suggest a query:

    "query": {
        "range" : {
            "timestamp" : {
                "lt" :  "now-5m/d"
            }
        }
    }

now-5m/d resolved to now minus 5 minutes, rounded down to UTC 00:00. It would resolve to 2020-05-07 00:00:00 for today (the 5 minutes is just an example, however I would substract some time to compensate ingest delay, if you omit substracting, some data points might not be searchable when the transform runs).

You still need a continuous transform for your usecase, so you would configure it with

"sync": {
    "time": {
      "field": "timestamp"
    }
  }

The transform will update the destination index whenever it finds new data, but because of the query it will only do it once a day, shortly after midnight. Transform will send a query to the source, collect the group items that have changed between the last and the new checkpoint and only update those (due to the query it will only find updates once a day, the other times it won't find updates and quickly return).

It might be good to have a last_updated field in your pivot using a max aggregation on timestamp. That way you know which group items got updated.

Thanks a lot Hendrik. This gives lot of information regarding transform index.

Can you please suggest on the following query.

When the destination index updates after 24 hrs. the data is overriding for the groupids with new data. can we able to store the data on day wise in destination index for the same groupids.

want to see the trend of the groupids on day wise.

simple use a date_histogram as additional group_by, something like:

"group_by": {
      "day": {
        "date_histogram": {
          "field" : "timestamp",
          "calendar_interval": "1d"
        }
      },
      "group": {
        "terms": {
          "field": "group_field"
        }
      }

You might also want to look into rollup, which is build around the compaction usecase. Whether to use transform or rollup depends, rollup provides rollup search, transform has some more flexibility with what you can do with the data.

Thanks a lot hendrik,

by adding one more group by with day can solve this issue.

will try it and update you the result.

Thanks Hendrik,

It is working perfectly fine.

Hi Hendrik,

could you please clarify the below.

  1. frequency will determine when the checkpoint to be done on source index
  2. delay in sync determines when the ingestion to be done on destination index from checkpoint .

I think our docs are clear:

The interval between checks for changes in the source indices when the transform is running
continuously. Also determines the retry interval in the event of transient failures while the
transform is searching or indexing. The minimum value is `1s` and the maximum is `1h` . The
default value is `1m` .

In the implementation it means: transform runs a query against the source index every minute, if frequency is set to 1m. If and only if transform finds changes, it creates a new checkpoint and runs the transform to update the destination according to the changes in source.

No, delay compensates ingest delay in the source. Assume you have a data shipper that sends metric of your servers with a timestamp attached to it, first time passes for network transfer (a data shipper might also send bulk requests say every minute), than it goes into some ingest pipeline, e.g. logstash, kafka, or ingest. Finally the data is persisted in an index but until this index is searchable - which means data can be retrieved - some time (refresh_interval) might pass again. If you sum up all the above you know, how much time passes between the application sending the metric until the data is available in the index / potentially readable for transform. This execution time can be different for every data point, now take the worst case of it: that's thedelay for this usecase.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.