Continous transforms accumulating delay, can be tweak for speed up?

Hi guys,

I use tranforms to create an aggregated index. The setup are :

  • 1 live index with live datas injections. Currently this index receive between 5 and 10 Millions of documents per hour and this will be growth in the future.
  • 1 aggregated index, datas comes from 12 Transforms in continuous mode. Transforms are setup to group by date histogram by hour (for making an hourly index) and 5 to 10 other group for split datas for query it. Every transforms calculate 1 to 3 values (count or sum).

Cluster is make with 3 hosts with 12 cpu, 32 Gb of ram and 2Tb of disk per node.

So my problem is than some transforms accumulate delay since the live index and it's a big problem. I have tried to :

  • set "max_page_search_size" to 10 000
  • set frequency to 30s

But no really change... It have a tricks or a tweak to speed up transforms ?

Best regards,

From the other post I assume you are already on 7.8?

The 1st thing to start with is looking at _transform/id/_stats, what's the biggest bottleneck?
What are these 12 transforms? Do you have 12 source indexes? Do you partition by query? Why 12?

(In 1 case a user partitioned data with a script query and it turned out that the script query killed more performance than the parallel execution improved performance. Additionally composite aggs are optimized for match_all)

Usually performance problems originate from search (see _stats), if so you can play with the profile API to improve search performance.

The order of the group_by might make a difference: date_histogram first, than high cardinality to low cardinality. Consider index sorting and use the same order in group_by.

Just some pointers to start, it can get complicated.

Yes I'am on elastic 7.8.

To understand on our project we have multiple "type" of datas collected and all are stored on 1 index. For exemple we track user "session" or "user page view". On document design I have :

  • 1 field "type" : it's the type of tracking
  • 1 field "action": it's the action linked to the tracking
  • 1 field "value" : an additionnal split of tracking

In this exemple, I have 2 transforms :

  • 1 with query "type=site" and "action=session" for count session
  • 1 with query "type=site" and "action=pages" for count page views

And on this tracking I have some other datas like "device", "browser" etc... Added on "pivot" on transforms.

Stats are like this on a transforms with some delay :

I don't really know how to interpret this values.

No script in my case.

I can but I think this is not a problem on search, request are really simple.

Interesting ! Don't know that the order of group_by is so important. For index sorting I have only set sorting by date histogram desc.

I don't really understand what's happen but it's more clear now, but strange.

So I have delete my old transforms and delete and recreate the destination index. I have create new transforms and adding group by "timestamp" with interval 1 hour at first.

When I start all transforms the first step work perfectly and fast : Old documents (500 millions) will be processed in few minutes.

But when transforms entered on "continuous" mode after the first checkpoint, it will be slow and having more and more delay...

When I check stats of a delayed transforms I can see strange behaviour. As the documentation say, transforms use latest checkpoint date to search new documents after this date. But on stats I can see datas from "indexer_position" behind the checkpoint date.

See my screens :

It would seem that with each iteration it traverses all destination index and this is for that it will be slow.

It's a bug or something wrong on my setup ?

1 Like

@Germain_Pavot

I have the similar issue with my transform job. I was digging through the issue for sometime, I came across your post. It helped me to clear few things, thank you for that. I will be tracking this post for any resolutions. And will let you know if I find anything on this.

1 Like

So, new tests... To verify my previous observation I have made a new strategy :

  • Clear all destination index, delete transform
  • Re-create transforms and I start it.

=> The first checkpoint is fast and old datas are processed in few minutes.

  • When the first checkpoint ended, I put a modification on query on transform and just add this :
      "range": {
          "processed_at": {
            "gte": "now-2h/d"
          }
        }

And with this modification Transforms are fast on every iteration but this is really tricky because if a Transforms have delay (with temporary cluster burst) I can have datas loose.

But this test confirm than by default Transform don't really consider the checkpoint date on query on continuous mode and this is not good when we try to made time based agregation on a large cluster.

I think a good setting on transforms can be :

  • Possibility to use the "checkpoint date" on reference for search query for made as exemple :

=> date range gte checkpointdate (if exists) - (time settings) (for made query with range gte checkpoint date - 10 minutes for example)

But this is only the first part of the problem, I can see on one big transform it have delay. When I get stats of it I can see that the indexer_position is before the time limit defined on source index query. I think this is the same problem : the indexer don't consider a range of date and this is not efficient for time based aggregation : largest will be the destination index, slower the transform.

I hope i'am clear on my explain :slight_smile:

1 Like

I'am Idiot :slight_smile: I have modify the time range to :

"range": {
          "processed_at": {
            "gte": "now-2h/h"
          }
        }

So no modification needed on code, it's now really fast. The good strategy is for time base aggregation :

  • At first create batches transforms for olds datas

  • At end, modify transforms to :

    1. Add sync parameters for set transforms as continuous
    2. Add a time range on queries to avoid process all datas on each iteration with a reasonnable range of time

And I suggest to elastic team the possibility of use "old checkpoint date" on query like that :

"range": {
          "processed_at": {
            "gte": "{{checkpoint}}-2h/h"
          }
        }

This will be more efficient (can set more aggressive time range) and more secure (if time is based on checkpoint and not on current date we avoid possibility of datas loose if transforms are throttle).

Thanks for the detailed investigation. Transform uses checkpoint information to narrow the search request and minimize the update. However, it's sometimes difficult, because transform is generic and must work with all supported aggregations and in all kinds of situations.

For data_histogram there are 2 reasons why the checkpoint time is not taken into account:

  • you are on < 7.7 (but afaik you use 7.8), this issue explains it
  • you use a different timestamp for sync and for date_histogram, this is an open issue and might be the case for you(timestamp vs processed_at)?

If you want to verify what transform does, you can adjust the transform logger to log the queries:

PUT /_cluster/settings
{
   "transient": {
      "logger.org.elasticsearch.xpack.transform.transforms": "trace"
   }
}

This should help to debug the problem.

Hi Hendrick,

Yeah you are true I have 2 "date" field :

  • 1 field "processed_at" is the timestamp added from logstash with the date of log processing, this is used for sync transform (useful for process old document re-index)
  • 1 field @timestamp used for date queries and this is the date of the log creation (the date reference for data analyse).

I think your are true, the usage of 2 different datas is the cause of this problem I can see on trace log this one :

range\":{\"processed_at\":{\"from\":null,\"to\":1594027879901

From should be the old checkpoint date ?

This is part of my transform setting :

{  
  "settings": {
    "max_page_search_size": 10000
  },
  "frequency": "60s",
  "id": "THE_ID",
  "source": {
    "index": [
      "source-*"
    ],
    "query": {
      "bool": {
        "filter": [
          {
             // my filter with temp fixe date range on "processed_at" field
          }
        ]
      }
    }
  },
  "dest": {
    "index": "destination-index"
  },
  "sync": {
    "time": {
      "field": "processed_at",
      "delay": "5m"
    }
  },
  "pivot": {
    "group_by": {
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "calendar_interval": "1h"
        }
      },
      // some other group by
    },
    "aggregations": {
      // some aggregation settings
    }
  }
}

So it's a "bug" ?

Great, we found it.

Subjective, I would say missing feature or known limitation. :wink:

Anyway, I noted this is something to improve, I have it on my list, but feel free to open a github issue in addition. It's good to get feedback and prioritize based on it.

1 Like

Hehe :slight_smile:

No problem I open an issue on github and I hope it will be "fixed" soon.

Thanks a lot

Issue created : https://github.com/elastic/elasticsearch/issues/59061

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.