Very slow aggregation performance for trivial aggs


(Zaar Hai) #1

Good day,

I have ES 1.5.2, deployed over 15 nodes. Each node has 40Gb RAM, two 2.5Ghz CPU cores and a single SSD drive.

I want to store some "sensor data" for each machine m and its sensors s. In each doc I capture machine_id (m), sensor id and sub_ids (s, s1, s2), timestamp (ts), and 4 values (v1..v4).

I've created 24 indexes with 15 shards each using the following mapping:

{ "sample": {
    "_all": {"enabled": False},
    "properties": {
        "m": {
            "type": "string",
            "index": "not_analyzed",
            "doc_values": True,
        },
        "s": {
            "type": "string",
            "index": "not_analyzed",
            "doc_values": True,
        },
        "ts": {
            "type": "date",
            "doc_values": True,
        },
        "ss1": {
            "type": "string",
            "index": "not_analyzed",
            "doc_values": True,
        },
        "ss1": {
            "type": "string",
            "index": "not_analyzed",
            "doc_values": True,
        },
        "v1": {
            "type": "long",
            "doc_values": True,
            "index": "no",
        },
        "v2": {
            "type": "long",
            "doc_values": True,
            "index": "no",
        },
        "v3": {
            "type": "long",
            "doc_values": True,
            "index": "no",
        },
        "v4": {
            "type": "long",
            "doc_values": True,
            "index": "no",
        }
    }
}
}

I've indexed 600 million docs into each index (using routing by m field). My indexing speed goes up to 200,000 docs/sec which is very satisfying, but the aggregation performance is very slow.

I've created a single alias for all of these indices. Now consider this simple aggregation:

curl http://example.com:9200/myalias/_search?search_type=count -d {
    "aggs": {
        "1_min_date": {
            "min": {
                "field": "ts"
            }
        },
        "2_max_date": {
            "max": {
                "field": "ts"
            }
        }
    }
}

I run it against my idle cluster and it took about 40 seconds to execute. During execution, about half of the I see all of the nodes CPU is busy 100%, then only a single node is 100% CPU busy for the rest of the time probably trying to aggregate the results.

Why is it so slow? ts field is indexed, so from my understanding, finding minimal value for it is a matter of O(1) operation on each segment of every shard. I.e. the complexity should be O(number_of_shards). I have 360 shards, to its 360 lookups and then sorting an array with 360 members.

What am I doing wrong?

Thank you.


(Mark Harwood) #2

ts field is indexed ... finding minimal value for it is a matter of O(1) operation on each segment of every shard

What you are describing is theoretically the best speed attainable if you ignore everything else aggregations are designed to do.
Let's look at what is missing from your example:

  1. No query to subset the data (e.g. "find all records with IP address X")
  2. No parent aggregation (e.g. a terms agg to group by "machine" field value)
  3. No filtering (e.g. an alias restricting access to docs via a filter)

Now consider that the max/min etc "leaf" aggregations are designed to sit at the end of this chain of filters and look up values for each doc produced by this stream. There's no cheating or shortcuts in these implementations that look in the inverted index for the last value in the global terms enum - they always expect to be processing a stream of (potentially filtered) doc IDs and each value for each doc needs to be retrieved to determine the max. So your cost will be O(n) where n is the number of docs on a shard. Obviously when you are using DocValues OS-level file system caching is going to be a big help.


Return document for aggregation result
(Zaar Hai) #3

Mark,
thanks for the detailed answer.

I understand the ElasticSearch way of doing this. Adding in some filterting and time-range bucketing that mimic my real application-to-be more closely, I brought query execution time below 400ms. The execution looks CPU bound still. I guess I'll get more speed if I'll run on cores with higher clocks.

I'm curious however if there is any efficient way to find min/max values over a large data sets in ElasticSearch?


(Mark Harwood) #4

You can get some of the info out of the field stats api
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-field-stats.html


(Zaar Hai) #5

This is for ES 1.6 and on, right?


(system) #6