Slow transform performance

I'm currently running a simple transform across ~500M docs but it is doing so super slowly, seemingly searching at <1000 docs/s, slower than we ingest them live. The nodes are barely getting their CPU or network used and I have the throttle set to 10,000/s so I'm not sure why this is. Is there any way I can investigate why it's so slow or attempt to speed it up?

With this little information it's hard to answer. Can you post

  • your configuration
  • Elasticsearch version
  • some description of the data, maybe some data example
  • the ouput of _transform/your_job/_stats

Transform can only go as fast as aggregation and bulk indexing can do their work.

Of course. I'm running ES 7.8 in a cluster of 12 nodes (3 masters, 6 hot and 3 cold). All the relevant data is on these hot nodes which have 12 vCPUs each and are backed by SSD storage.

The data is simple logs of player activity and the transform I am performing simply looks for the first and last log (@timestamp.min and @timestamp.max) and associates them with a userId, so we can see when a user first joined and were last active. The transform appears to be running on a cold node which only has 2 vCPUs and is HDD backed but it doesn't appear to be limiting it in any significant way (currently 1% cpu usage).

Here are the stats for the transform.

{
  "count" : 1,
  "transforms" : [
    {
      "id" : "user-total-play-timeframe",
      "state" : "indexing",
      "node" : {
        "id" : "VKyCmhWCQw6EeD2Y2a_Irg",
        "name" : "elasticsearch-data-cold-1",
        "ephemeral_id" : "5OKGvo-MTcGwmRGNcAXVUw",
        "transport_address" : "10.4.6.5:9300",
        "attributes" : { }
      },
      "stats" : {
        "pages_processed" : 4783,
        "documents_processed" : 43789906,
        "documents_indexed" : 2391500,
        "trigger_count" : 6,
        "index_time_in_ms" : 309535,
        "index_total" : 4783,
        "index_failures" : 0,
        "search_time_in_ms" : 78265914,
        "search_total" : 4784,
        "search_failures" : 1,
        "processing_time_in_ms" : 26082,
        "processing_total" : 4783,
        "exponential_avg_checkpoint_duration_ms" : 0.0,
        "exponential_avg_documents_indexed" : 0.0,
        "exponential_avg_documents_processed" : 0.0
      },
      "checkpointing" : {
        "last" : {
          "checkpoint" : 0
        },
        "next" : {
          "checkpoint" : 1,
          "position" : {
            "indexer_position" : {
              "user_id" : "1194832109"
            }
          },
          "checkpoint_progress" : {
            "docs_remaining" : 480120322,
            "total_docs" : 523910228,
            "percent_complete" : 8.358284236436019,
            "docs_indexed" : 2391500,
            "docs_processed" : 43789906
          },
          "timestamp_millis" : 1593695873672
        },
        "operations_behind" : 766723173
      }
    }
  ]
}

Hi,

You could try disabling transformation on your cold nodes so transforms are always performed on the hot nodes: https://www.elastic.co/guide/en/elasticsearch/reference/master/transform-setup.html#transform-setup-nodes

Best regards
Wolfram

You are at checkpoint 0 and in the progress of creating checkpoint 1, that means you are in the bootstrapping phase.

Do you have a lot of historic/old data? If so, the very first checkpoint will take time to process all the historic data, only after checkpoint 1 it goes into the continuous phase.

The transform task is not your problem, from the stats take a look at these 3 relevant ones:

"index_time_in_ms" : 309535,
"search_time_in_ms" : 78265914,
"processing_time_in_ms" : 26082,

processing is negligible small. This is what the transform task is using, therefore if you move the task to another node, it will not make a difference, this is not your bottleneck.

Most time is spend in search, search is distributed and happens, where the data is stored. Apart from your problem of potentially bootstrapping big historic data, this is what you need to optimize.

As your aggregations are simple, you can increase max_page_search_size, that defines how many buckets per search are created, the default is 500, the max is 10000, increasing will increase memory usage, but for simple aggs this shouldn't matter. You can use the _update API to set it (you don't need to stop/start).

It might be worth to look at hot threads, maybe thread pools are the limitation.

Whats the setup of your source data? shards, index sort? etc.

2 Likes

Increasing max_page_search_size seems to have sped it up a lot and is now transforming at a reasonable speed, thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.