I have a question which way of writing a query is faster.
Let's say we want to calculate two things, total revenue and total unique users, both of which are separate "event types", in order to either visualise them directly or to visualise average revenue per user.
Is it faster to, in pseudo code terms, just calculate the values directly such as:
totalRevenue = sum of data.purchasePriceInUSD,
uniqueUsers = cardinality of data.userId.hash (murmur3 hash)
with a filter of *.
Or is it faster to filter them first it such as
for totalRevenue filter on data.eventType:'Purchase'
for uniqueUsers filter on data.eventType:'AppLoad'
(Both approaches will yield the same result)
In other words, is it faster to process more events with the sum and cardinality operators (purchasePriceInUSD is only available in 'Purchase' type events, userId.hash is available in all event types) without filtering, or is it faster to first filter to only the relevant events and operate on them only? My intuition says it's faster to filter first but wanted to check here for guidelines. We have several billion events in total.