I use tranforms to create an aggregated index. The setup are :
1 live index with live datas injections. Currently this index receive between 5 and 10 Millions of documents per hour and this will be growth in the future.
1 aggregated index, datas comes from 12 Transforms in continuous mode. Transforms are setup to group by date histogram by hour (for making an hourly index) and 5 to 10 other group for split datas for query it. Every transforms calculate 1 to 3 values (count or sum).
Cluster is make with 3 hosts with 12 cpu, 32 Gb of ram and 2Tb of disk per node.
So my problem is than some transforms accumulate delay since the live index and it's a big problem. I have tried to :
set "max_page_search_size" to 10 000
set frequency to 30s
But no really change... It have a tricks or a tweak to speed up transforms ?
From the other post I assume you are already on 7.8?
The 1st thing to start with is looking at _transform/id/_stats, what's the biggest bottleneck?
What are these 12 transforms? Do you have 12 source indexes? Do you partition by query? Why 12?
(In 1 case a user partitioned data with a script query and it turned out that the script query killed more performance than the parallel execution improved performance. Additionally composite aggs are optimized for match_all)
Usually performance problems originate from search (see _stats), if so you can play with the profile API to improve search performance.
The order of the group_by might make a difference: date_histogram first, than high cardinality to low cardinality. Consider index sorting and use the same order in group_by.
Just some pointers to start, it can get complicated.
To understand on our project we have multiple "type" of datas collected and all are stored on 1 index. For exemple we track user "session" or "user page view". On document design I have :
1 field "type" : it's the type of tracking
1 field "action": it's the action linked to the tracking
1 field "value" : an additionnal split of tracking
In this exemple, I have 2 transforms :
1 with query "type=site" and "action=session" for count session
1 with query "type=site" and "action=pages" for count page views
And on this tracking I have some other datas like "device", "browser" etc... Added on "pivot" on transforms.
Stats are like this on a transforms with some delay :
I don't really understand what's happen but it's more clear now, but strange.
So I have delete my old transforms and delete and recreate the destination index. I have create new transforms and adding group by "timestamp" with interval 1 hour at first.
When I start all transforms the first step work perfectly and fast : Old documents (500 millions) will be processed in few minutes.
But when transforms entered on "continuous" mode after the first checkpoint, it will be slow and having more and more delay...
When I check stats of a delayed transforms I can see strange behaviour. As the documentation say, transforms use latest checkpoint date to search new documents after this date. But on stats I can see datas from "indexer_position" behind the checkpoint date.
I have the similar issue with my transform job. I was digging through the issue for sometime, I came across your post. It helped me to clear few things, thank you for that. I will be tracking this post for any resolutions. And will let you know if I find anything on this.
And with this modification Transforms are fast on every iteration but this is really tricky because if a Transforms have delay (with temporary cluster burst) I can have datas loose.
But this test confirm than by default Transform don't really consider the checkpoint date on query on continuous mode and this is not good when we try to made time based agregation on a large cluster.
I think a good setting on transforms can be :
Possibility to use the "checkpoint date" on reference for search query for made as exemple :
=> date range gte checkpointdate (if exists) - (time settings) (for made query with range gte checkpoint date - 10 minutes for example)
But this is only the first part of the problem, I can see on one big transform it have delay. When I get stats of it I can see that the indexer_position is before the time limit defined on source index query. I think this is the same problem : the indexer don't consider a range of date and this is not efficient for time based aggregation : largest will be the destination index, slower the transform.
This will be more efficient (can set more aggressive time range) and more secure (if time is based on checkpoint and not on current date we avoid possibility of datas loose if transforms are throttle).
Thanks for the detailed investigation. Transform uses checkpoint information to narrow the search request and minimize the update. However, it's sometimes difficult, because transform is generic and must work with all supported aggregations and in all kinds of situations.
For data_histogram there are 2 reasons why the checkpoint time is not taken into account:
you are on < 7.7 (but afaik you use 7.8), this issue explains it
you use a different timestamp for sync and for date_histogram, this is an open issue and might be the case for you(timestamp vs processed_at)?
If you want to verify what transform does, you can adjust the transform logger to log the queries:
PUT /_cluster/settings
{
"transient": {
"logger.org.elasticsearch.xpack.transform.transforms": "trace"
}
}
1 field "processed_at" is the timestamp added from logstash with the date of log processing, this is used for sync transform (useful for process old document re-index)
1 field @timestamp used for date queries and this is the date of the log creation (the date reference for data analyse).
I think your are true, the usage of 2 different datas is the cause of this problem I can see on trace log this one :
Subjective, I would say missing feature or known limitation.
Anyway, I noted this is something to improve, I have it on my list, but feel free to open a github issue in addition. It's good to get feedback and prioritize based on it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.