Transform - Continuous mode for more than 1 index?

Hi all.

I have been searching the web and elastic forum looking for solution, but it doesn't seem like people are having problem with the Transform function.

I'm using Transform to create summary tables to optimize queries. It's working if i were to create a new transformation. However, everyday a newly index with the same index pattern will be added to elasticsearch. I knowing that there is the capability to continuously perform transformation by checks for changes to source indices continuously . How should i configure to make it working?

For e.g.
The following are the indices, on each day, a new index will be added.
snort-2020-06-01
snort 2020-06-02
snort 2020-06-03
....
snort 2020-06-30 (newly added)

When doing transform, i chose the index pattern "snort*".

I want the "continuous transform function" to pick up new index pattern and update the transformed index.

Appreciate much!

refering to to the above image, source index is snort*

Your approach looks fine to me, you can use index patterns with wildcards as source of the transform, an alias that points to multiple indices should work as well.

If you are using date_histogram in your group_by it's advised to use at least 7.7, which introduced an optimization for this case.

If you need specific help, please post your job configuration or at least the parts you have questions.

Hey @Hendrik_Muhs

Thanks for your reply, but my new indices are not being picked up by the transformed_snort job. Currently i'm just using one node for elasticsearch, but i suppose it will not affect the transformation job?

The following is my job configuration

In part 2 of the configuration, I suppose the Date field '@timestamp' refers to the timestamp in the index pattern(snort*)?
As per my requirement, I do not have any @timestamp in my transformed_snort except @timestamp.max and @timestamp.min for aggregations.

When using the kibana UI you are using kibana index patterns. I am not sure, why its not picked up. Can you check source in dev console:

GET _transform/{name}

Maybe the kibana index pattern is not setup correctly, but it looks ok to me. How do you know its not picking up new indices? Does visualizations work using the same pattern?

For a continuous transform you specify the timestamp field in the source index, the suggested @timestamp looks ok to me.

{
"count" : 1,
"transforms" : [
{
"id" : "transformed_snort",
"source" : {
"index" : [
"snort*"
],
"query" : {
"match_all" : { }
}
},
"dest" : {
"index" : "transformed_snort"
},
"sync" : {
"time" : {
"field" : "@timestamp",
"delay" : "60s"
}
},
"pivot" : {
"group_by" : {
"ip.dst" : {
"terms" : {
"field" : "ip.dst"
}
},
"ip.src" : {
"terms" : {
"field" : "ip.src"
}
},
"port.dst" : {
"terms" : {
"field" : "port.dst"
}
},
"port.src" : {
"terms" : {
"field" : "port.src"
}
},
"frame.protocols" : {
"terms" : {
"field" : "frame.protocols"
}
}
},
"aggregations" : {
"timestamp_max" : {
"max" : {
"field" : "@timestamp"
}
},
"timestamp_min" : {
"min" : {
"field" : "@timestamp"
}
}
}
},
"description" : "transformed_snort",
"version" : "7.7.0",
"create_time" : 1593585286020
}
]
}

Source seems correct here as well.

Everyday there will be new index created. I found that "doc count" and "size" of the transformed_snort did not increase, and suspected something when wrong with the continuous transform.

Visualization of the index pattern(snort*) is working well for me.

ok, thanks. I do not see any problem in your config. The next step towards debugging your case is to take a look at the stats:

GET _transform/{name}/_stats

This is also available in the UI if you click on the little arrow left of the transform name.

{
"count" : 1,
"transforms" : [
{
"id" : "transformed_snort",
"state" : "started",
"node" : {
"id" : "2_u7iCyARIG-RiPdv-PyHg",
"name" : "instance-2",
"ephemeral_id" : "A3as6ogqTLSrAee0dXcaNQ",
"transport_address" : "127.0.0.1:9300",
"attributes" : { }
},
"stats" : {
"pages_processed" : 3130,
"documents_processed" : 11043063,
"documents_indexed" : 1563650,
"trigger_count" : 230,
"index_time_in_ms" : 99674,
"index_total" : 3128,
"index_failures" : 0,
"search_time_in_ms" : 12581507,
"search_total" : 3130,
"search_failures" : 0,
"processing_time_in_ms" : 21175,
"processing_total" : 3130,
"exponential_avg_checkpoint_duration_ms" : 1.0403123727272727E7,
"exponential_avg_documents_indexed" : 1279350.0,
"exponential_avg_documents_processed" : 9035233.363636363
},
"checkpointing" : {
"last" : {
"checkpoint" : 2,
"timestamp_millis" : 1593611391619,
"time_upper_bound_millis" : 1593611331619
},
"operations_behind" : 1597137
}
}
]
}

here you go..

Thanks, again I do not see something wrong.

The stats do not contain any error (see the counters for _failures), the checkpoint is only 2, which means it created only 1 more, however trigger count is 230, so it at least checked > 200 times for updates.

The time_upper_bound_millis corresponds to 07/01/2020 @ 1:48pm (UTC). The data you are adding is not before that?

Thanks @Hendrik_Muhs !

Seems like the root cause is time_upper_bound_millis. My data are before this date.

I'm not sure how this time_upper_bound_millis is being populated when my data do not have this datetime.

Do you know how this timestamp is added in and how can i force a re-transform (re-index)?

The concept of a continuous transform is to continually increment and process checkpoints as new source data is ingested. The timestamp used for synchronizing source and dest must follow real time, meaning it must be a recent timestamp. To adjust for index delays, e.g. because the timestamp you use runs behind due to processing delays, you can use the delay parameter, default 60s. That causes transform to deduct the delay when querying data, e.g. lt now-delay.

If you process historic data, there is no need to use a continuous transform, but you can use a batch transform. Is there a reason you want to process historic data but still use continuous mode?

There is a trick, instead of using the historic timestamp, you can add an ingest timestamp while you are feeding in new data, here is how. You than use the ingest timestamp for sync, you can keep your timestamp_max and timestamp_min as is.

@Hendrik_Muhs

Thank you for clarifying. I am testing out a solution that maybe deployed in production, so i wanted to mimic it as far as possible but i do not have access to real-time data.

But I think the "trick" by creating a document_created_datetime will work for both dev/prod.

A big thank you for your help. :+1:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.