Slow transform performance

magnalite · July 3, 2020, 11:12am

I'm currently running a simple transform across ~500M docs but it is doing so super slowly, seemingly searching at <1000 docs/s, slower than we ingest them live. The nodes are barely getting their CPU or network used and I have the throttle set to 10,000/s so I'm not sure why this is. Is there any way I can investigate why it's so slow or attempt to speed it up?

Hendrik_Muhs · July 3, 2020, 11:20am

With this little information it's hard to answer. Can you post

your configuration
Elasticsearch version
some description of the data, maybe some data example
the ouput of _transform/your_job/_stats

Transform can only go as fast as aggregation and bulk indexing can do their work.

magnalite · July 3, 2020, 11:27am

Of course. I'm running ES 7.8 in a cluster of 12 nodes (3 masters, 6 hot and 3 cold). All the relevant data is on these hot nodes which have 12 vCPUs each and are backed by SSD storage.

The data is simple logs of player activity and the transform I am performing simply looks for the first and last log (@timestamp.min and @timestamp.max) and associates them with a userId, so we can see when a user first joined and were last active. The transform appears to be running on a cold node which only has 2 vCPUs and is HDD backed but it doesn't appear to be limiting it in any significant way (currently 1% cpu usage).

Here are the stats for the transform.

{
  "count" : 1,
  "transforms" : [
    {
      "id" : "user-total-play-timeframe",
      "state" : "indexing",
      "node" : {
        "id" : "VKyCmhWCQw6EeD2Y2a_Irg",
        "name" : "elasticsearch-data-cold-1",
        "ephemeral_id" : "5OKGvo-MTcGwmRGNcAXVUw",
        "transport_address" : "10.4.6.5:9300",
        "attributes" : { }
      },
      "stats" : {
        "pages_processed" : 4783,
        "documents_processed" : 43789906,
        "documents_indexed" : 2391500,
        "trigger_count" : 6,
        "index_time_in_ms" : 309535,
        "index_total" : 4783,
        "index_failures" : 0,
        "search_time_in_ms" : 78265914,
        "search_total" : 4784,
        "search_failures" : 1,
        "processing_time_in_ms" : 26082,
        "processing_total" : 4783,
        "exponential_avg_checkpoint_duration_ms" : 0.0,
        "exponential_avg_documents_indexed" : 0.0,
        "exponential_avg_documents_processed" : 0.0
      },
      "checkpointing" : {
        "last" : {
          "checkpoint" : 0
        },
        "next" : {
          "checkpoint" : 1,
          "position" : {
            "indexer_position" : {
              "user_id" : "1194832109"
            }
          },
          "checkpoint_progress" : {
            "docs_remaining" : 480120322,
            "total_docs" : 523910228,
            "percent_complete" : 8.358284236436019,
            "docs_indexed" : 2391500,
            "docs_processed" : 43789906
          },
          "timestamp_millis" : 1593695873672
        },
        "operations_behind" : 766723173
      }
    }
  ]
}

Wolfram_Haussig · July 3, 2020, 11:32am

Hi,

You could try disabling transformation on your cold nodes so transforms are always performed on the hot nodes: https://www.elastic.co/guide/en/elasticsearch/reference/master/transform-setup.html#transform-setup-nodes

Best regards
Wolfram

Hendrik_Muhs · July 3, 2020, 11:41am

You are at checkpoint 0 and in the progress of creating checkpoint 1, that means you are in the bootstrapping phase.

Do you have a lot of historic/old data? If so, the very first checkpoint will take time to process all the historic data, only after checkpoint 1 it goes into the continuous phase.

The transform task is not your problem, from the stats take a look at these 3 relevant ones:

"index_time_in_ms" : 309535,
"search_time_in_ms" : 78265914,
"processing_time_in_ms" : 26082,

processing is negligible small. This is what the transform task is using, therefore if you move the task to another node, it will not make a difference, this is not your bottleneck.

Most time is spend in search, search is distributed and happens, where the data is stored. Apart from your problem of potentially bootstrapping big historic data, this is what you need to optimize.

Hendrik_Muhs · July 3, 2020, 11:58am

As your aggregations are simple, you can increase max_page_search_size, that defines how many buckets per search are created, the default is 500, the max is 10000, increasing will increase memory usage, but for simple aggs this shouldn't matter. You can use the _update API to set it (you don't need to stop/start).

It might be worth to look at hot threads, maybe thread pools are the limitation.

Whats the setup of your source data? shards, index sort? etc.

magnalite · July 3, 2020, 3:01pm

Increasing max_page_search_size seems to have sped it up a lot and is now transforming at a reasonable speed, thanks!

system · July 31, 2020, 3:01pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Slow first request on an index after a short amount of time Elasticsearch	6	10243	March 13, 2020
Transforms from first principles Elasticsearch transforms	2	299	September 18, 2023
Slow Indexing speed / Bottleneck Elasticsearch	6	733	September 16, 2020
Slow aggregation queries, only after data change (ES 2.3) Elasticsearch	9	1326	December 26, 2016
Transform kills the whole cluster Elasticsearch	4	467	July 27, 2020

Slow transform performance

Related topics