Hi does transfom update documents? I have a continuous transform running on and index which continuously has new documents injested and existing documents updated. Each Document has a field which is called "doc.lastUpdated" which is a date field that specifies when the document was last modified.
When setting up a transform I provide a date field "Select the date field that can be used to identify new documents". Should this field be the date a Document is created "doc.timeStamp" or the date a Document is last updated "doc.lastUpdated" ?
I saw this in the documentation but I wasnt clear on this setting.
"Changed entities will only be identified if their time field has also been updated and falls within the range of the action to check for changes. This has been designed in principle for, and is suited to, the use case where new data is given a timestamp for the time of ingest."
Yes, transform updates documents or technically correct: it recalculates and overwrites them.
Continuous transforms synchronizes based on a time field, this time field must be based on real time but not some arbitrary date field. Such a timestamp could be an ingest timestamp but also a timestamp that is created externally. As data processing takes time, its important to configure the delay parameter to compensate such ingest delays.
The update works as follows: Say transform processed everything till timestamp A, now data comes in and at timestamp B it checks which entities have been changed between A and B. With this information it recalculates only changed buckets. Recalculation however requires access to all data (till timestamp B).
We are working on various improvements, e.g. the minimization of updates is only available for grouping by terms at the moment. With 7.7 we ship an optimization for date histograms. Longer term we want to add real updates in order to avoid access to historic data, which might have been deleted meanwhile.
That's correct frequency controls how often transform triggers a query to check for updates in source.
What I meant with "minimization of updates" are query filters to narrow the search space. E.g. if you group by terms it uses a terms query. Date histograms will be optimized in >= 7.7 using a range query. Such optimization have substantial impact on runtime performance.
If you are curious about the internals, transform will tell you the queries it runs, after enabling trace logging:
PUT /_cluster/settings
{
"transient": {
"logger.org.elasticsearch.xpack.transform.transforms": "trace"
}
}
As of now I can only group by a customers Id,phonenumber,ip, etc.... The problem i face is one customer can have many orders and when grouping by customer the scripting aggregations for feature variable creation can be pretty complex.
For example if I need to use inference to alert on specific orders and the customer has 6 orders total, its a problem as im alerting on the grouping of customer not individual orders. I guess i could remove the grouping altogether but then i lose the context of a customer...
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.