Daily/hourly aggregation using transform

Hello

I want to have a daily aggregation index with daily avg, percentiles etc .

I've created a daily transform with: (more complex than below)

"pivot": {
    "group_by": {
       "datetime": {
        "date_histogram": {
          "field": "datetime",
          "calendar_interval": "1d"
         
        },
       "server": {
         "terms": {
            "field": "server"
          }
       },
      },
  "aggregations": {
      "avg_value": {
        "avg": {
          "field": "value"
        }
      },...........

and an hourly with:

"pivot": {
    "group_by": {
       "datetime": {
        "date_histogram": {
          "field": "datetime",
          "calendar_interval": "1h"
         
        },
       "server": {
         "terms": {
            "field": "server"
          }
       },
      },
  "aggregations": {
      "avg_value": {
        "avg": {
          "field": "value"
        }
      },...........

when I start it, it works fine, and I'm happy with the results in the destination index.

the issue is, that It causes very high cpu, and I see it in "indexing" state ALL THE TIME

I tried "frequency": "1h", (maximum value) for the daily, but it's still indexing all the time

what's the explanation for this?

any solution for running a daily aggregation every day at 03:00 (lots of dimensions so I think watcher isn't the right tool), and hourly each HH:10?

thanks!

You have choosed the continuous mode for the tranform ?
It's better with continuous
It will take some time to process existing data and create the checkpoint then it will process new coming documents quickly

Otherwise you can also try rollup instead of tranform

Thanks.
Yes continuous mode.
It already calculated the history but still indexing.
What the advantage of the rollup?

in rollup you can define a schedule as you refer, but you have limited aggs possibilities

Transform can only run as fast as search and indexing can work. To analyze why your transform runs slow and does not seem to be able to keep up with the source you need to analyze it. Without knowing the full config I can only give some hints.

The 1st thing you can do is looking at the output of _transform/{your_transform}/_stats (maybe you can post the output). It e.g. reports where most time is spend (search, indexing or processing). In many cases I saw, search is the main bottleneck.

I search runs slow, try to optimize your query by running it outside of transform and e.g. use the profile API to optimize it. You can find further query optimization guides in the internet that might help. It also might help to apply index sorting on the source index(in the same order as you do the group_by).

If indexing runs slow you might create the destination index yourself and use more shards, disable _source and/or increase the refresh_interval.

A problem with transform might occur if you use <7.7. Do you use the same timestamp for your date_histogram group_by and sync? If not, this can lead to a bad runtime, too.

As you see, it's hard to answer those sizing/performance questions. In the end you might simple need to use better or more hardware.

Another tool that might help to analyze performance is rally, we added support for transform recently.

When it comes to rollup vs. transform vs. other: The underlying technology is the same. As said in the beginning: it can be as fast as search and indexing. If e.g. search is your bottleneck it doesn't matter if you use rollup or transform.

Choose rollup or transform based on functional requirements, both have unique features, e.g. rollup provides rollup search, transform has greater flexibility, etc. Or as you indicated: rollup provides cron like scheduling, for transform this is still an open enhancement request.

Rollup might indeed be better in terms of performance: rollup only writes complete buckets, transform always writes as much as it can and updates docs. This causes transform to do more operations than rollup. E.g. if you use a daily group_by rollup wouldn't touch the source until the day is over, but transform will take as much data as it can and update the buckets throughout the day. On the next day rollup and transform produced the same result. Whether you need those intermediate results depends on your use case and to get intermediate results with rollup you might as well use rollup search.

Again, without more information and at least the output of _stats its not possible to answer your performance problem. Still I hope I gave you some hints on where to start investigating the bottleneck.

2 Likes

thanks a lot, attached

{
  "count" : 1,
  "transforms" : [
{
  "id" : "hourly",
  "state" : "indexing",
  "node" : {
    "id" : "-CBycOpbTgS1q7o3CmcXCg",
    "name" : " ",
    "ephemeral_id" : "NuAe9e0mQj6-FPtyPyCCaA",
    "transport_address" : " ",
    "attributes" : { }
  },
  "stats" : {
    "pages_processed" : 833,
    "documents_processed" : 7998803,
    "documents_indexed" : 416500,
    "trigger_count" : 1,
    "index_time_in_ms" : 8290888,
    "index_total" : 833,
    "index_failures" : 0,
    "search_time_in_ms" : 54326124,
    "search_total" : 833,
    "search_failures" : 0,
    "processing_time_in_ms" : 6441,
    "processing_total" : 833,
    "exponential_avg_checkpoint_duration_ms" : 0.0,
    "exponential_avg_documents_indexed" : 0.0,
    "exponential_avg_documents_processed" : 0.0
  },
  "checkpointing" : {
    "last" : {
      "checkpoint" : 0
    },
    "next" : {
      "checkpoint" : 1,
      "position" : {
        "indexer_position" : {
          "clientname" : "XXX",
          "servername" : "YYY",
          "category" : "Databases",
          "platform" : "Microsoft SQL Server",
          "counter" : "Log Flushes/sec",
          "datetime" : 1600232400000,
          "instance" : "ZZZ"
        }
      },
      "checkpoint_progress" : {
        "docs_remaining" : 30341729,
        "total_docs" : 38340532,
        "percent_complete" : 20.862524807949978,
        "docs_indexed" : 416500,
        "docs_processed" : 7998803
      },
      "timestamp_millis" : 1600179143842,
      "time_upper_bound_millis" : 1600175543842
    },
    "operations_behind" : 16105831905
  }
}
  ]
}

Definition:

PUT _transform/hourly
{
  "source": {
    "index": [
      "data-*"
    ],
    "query": {
      "range": {
        "datetime": {
          "gte": "2020-09-15T00:00:00"
        }
      }
    }
  },
  "pivot": {
    "group_by": {
      "clientname": {
        "terms": {
          "field": "clientname"
        }
      },
      "servername": {
        "terms": {
          "field": "servername"
        }
      },
      "category": {
        "terms": {
          "field": "category"
        }
      },
      "platform": {
        "terms": {
          "field": "platform"
        }
      },
      "counter": {
        "terms": {
          "field": "counter"
        }
      },
      "datetime": {
        "date_histogram": {
          "field": "datetime",
          "calendar_interval": "1h"
        }
      },
      "instance": {
        "terms": {
          "script": {
            "source": "if(doc['instance'].size()>0 && doc['instance'].value !=null){return doc['instance'].value;} return ''; "
          }
        }
      }
    },
    "aggregations": {
      "clientid": {
        "max": {
          "field": "clientid"
        }
      },
      "avg_value": {
        "avg": {
          "field": "value"
        }
      },
      "count": {
        "value_count": {
          "field": "datetime"
        }
      },
      "percentiles_value": {
        "percentiles": {
          "field": "value",
          "percents": [
            4,
            11,
            90,
            97
          ]
        }
      },
      "min_value": {
        "min": {
          "field": "value"
        }
      },
      "max_value": {
        "max": {
          "field": "value"
        }
      }
    }
  },
  "dest": {
    "index": "hourly"
  },
  "settings": {
    "docs_per_second": 250,
    "max_page_search_size": 500
  },
  "description": "Hourly aggregation",
  "frequency": "1h",
  "sync": {
    "time": {
      "field": "datetime",
      "delay": "4h"
    }
  }
}

Thanks, as guessed search_time_in_ms is higher than processing and indexing: 15h vs. 2.3h vs. 6s. So search should be optimized first. However note that indexing_time_in_ms only includes sending the indexing requests, indexing might be still a heavy operation. You can find out how much time is spend indexing via node stats.

Some things I noticed:

  • sort the group_by from high cardinality to low cardinality, always pivot first by date_histogram (consider index sorting by date_time on the source index)
  • for group by on instance switch to missing_bucket once 7.10 is available, if you insist on '' instead of null, use an ingest pipeline to set null to ''
  • max_page_search_size seems low given your aggregations (if they are complex, e.g. cardinality or scripted metric choose a low value, otherwise you can increase the default)
  • the configuration looks like a metric use case, consider disabling _source on the transform destination, you might disable it for the source, too.

Otherwise I do not see anything substantially wrong with this config.

1 Like

Thanks a lot

I made the suggested changes (except the 7.10, which still not released of course, and you are right- that's the reason I've used script- the value is null is some documents)

it's running (almost) well
but now I have a new error when it gets to the live documents (finished with history sync):

2020-09-16 12:04:03
instance-0000000074
task encountered irrecoverable failure: field name cannot be null.
2020-09-16 12:18:05
instance-0000000074
Transform has stopped.

Argh, this is the known bug about the incompatibility of continuous transform and scripted group_by. The fix is already scheduled for 7.10, which will also add support for missing_bucket.

I am afraid, you can't use scripted group_by. Are you able to fix your ingest? Setting the field to a default in your source ingest would be the only way to not drop missing values right now. Because handling missing values is complex and expensive, it's actually highly recommended to fix the ingest, especially if performance is critical.

I am sorry, I do not have a better answer right now, but I will investigate other options and let you know.

After revisiting this problem, the bugfix scheduled for 7.10 got backported for 7.9.2. This will reduce the wait time. However I can not give you any information about the release date of 7.9.2 (nor 7.10).

Thanks a lot for the helpful information.
I removed the nulls before the ingestion and we're good now.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.