If you look at the architecture of an Elasticsearch/Lucene index, you will notice it is an inverted index. Inverted means, not the documents are indexed, but the terms of the documents. The terms are in a dictionary and the term positions are recorded in a posting list, with constant lookup time once they are loaded into main memory. So if you have have few terms and millions or billions of docs, the search algorithm does not slow down in it's runtime complexity.
Aggregations add a substantial amount of time depending on the type of aggregation. Typically they do not iterate over the documents, that would be too slow - they estimate cardinality for example - but it's hard to comment without seeing your aggregation. This is totally different algorithm from query.
Another factor is the CPU power of the node. You have chosen single node / 5 shards, maybe with 10 or 20 segments each. For most CPUs, this is no challenge, it correlates with the maximum CPU core count. For concurrent execution, the Java VM spawns threads, and that thread count must be available on the node, or search subtasks will get queued. By dividing the work over available threads, search time of a node is in reality bound by the CPU multi tasking power. You can test this by creating e.g. 1000 shards on a node for a single index - beside using more memory, search time will get longer because resources for query executions will exceed maximum ES thread pool size, which is adapted to the available CPU core count.