Poor I/O on heavy aggregation

Not sure if elasticsearch is involved, but, when doing a big aggregation (100millions doc, unique count on big cardinality fields, we use doc values), this:

  • takes 1 minute
  • consumes 50% CPU
  • RAM is stable, heap size is stable (no pike at all)
  • I/O goes max to 30 MB/s

I am wondering if there is something that could limit performance? (we have SSD that should deliver 500MB/s, I am quite surprise).

Thanks you,

How many shards are you aggregating across? How many nodes do you have in the cluster?

Queries and aggregations are performed in a single thread per shard, but shards can be processed in parallel. If you therefore have a low number of shards per node that you are aggregating across and issue a single query, you can be limited by CPU even though it is not maxed out on the node.

Ha thanks you, didnt know that (1 thread per shard).

I have a 3 data nodes cluster (and 2 LB nodes), of 7 indices with 3 shards each = 21 shards total.

What metrics should I look at?

Thanks you,

Is that 21 primary shards? How many CPU cores does each node have?

I would also recommend looking at iostat and see what the disk utilisation an await looks like. Read throughput on its own may not be the best measure as the load typically consists of a lot of random access reads.

Ya 21 primary shards, 1 replica = 42 shards total.

We have SSD, so I am surprised to see maximum I/O read at 30MB/s:

Zabbix screenshot => http://b3.ms/Oq5nRGK5M0wB

As we use VMWare, I suspect a bad configuration from our sys-admin, should be that possible?

Thanks you,

You may want to verify that the VMs have access to the resources to have assigned and are not over-provisioned with respect to CPU or RAM.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.