Joining time-series data to calculate differences


(Robert Krombholz) #1

Hi All,

My documents that I index in ES look conceptually similar to this:

{
	"city" : "blah blah",
	"country" : "blah blah",
	"temperature" : "12.0",
	"measureDate": "2015-11-29T11:00:00.000Z"
	...
}

Note: There can be multiple documents(measures) per city per day.
Got one new Index per day with an overall amount of ~250.000.000 documents per day.

Now the question that I try to answer is as follows:

For a specific country (terms filter), please give me all cities with the highest temperature difference from one day to another (using the max temperature of the specific city) sorted by the highest difference.

Does anyone have a good idea on how to do such joins across two time slices?

So far I do two terms aggregations on the city (one on each of the queries indices for the two days), spit out the max temperature for each bucket and then calculate the difference on the client side.
Then I sort the result on the client side, generate the document IDs (they are derived by fields I have as a result of the terms aggregation) and query the top 10 documents.

The problem with this is that the initial two terms aggregation take very long (~20 seconds for a large country with a lot of cities).
This is most likely because I can not limit those two terms aggretations. I need all the results (size: 0) in order to properly do the joins on the client side.

Is Elasticsearch maybe not the right choice for this kind of Problem?
Or could I use a scripted aggregation (to let ES calculate the difference) to my advantage?

A derivative pipeline aggregation in theory does exactly what I need but again I would need to run this across ALL city terms (and there may be a lot) because I can not order the buckets by the 'difference' to the other day.
This is not going to be any better in terms of performance.

Hope this is somewhat understandable. Any input or suggestion would be very helpful.

Cheers
Robert


(Mark Walkom) #2

I can't think of a better way to do it, doesn't mean there isn't one though :stuck_out_tongue:

But if it's slow you may just need to put more resources towards it?


(Robert Krombholz) #3

I'm not really sure how to find out what kind of resource bottleneck this my be caused from.
From the operating system everything looks okay.
Have 2 shards with 1 replica each distributed across 4 machines.
Each ES instance got a 16GB heap and plenty of ram left for IO buffering.
Can't see and spikes in disk IO or CPU usage while those terms aggegrations run.

I implemented routing based on the 'country' field (because thats what the majority of my queries are filtering for).

I am wondering if routing in this scenario is good or bad for performance.
The large term queries will end up running completely on one shard because of my routing.
This is good on one side because the results dont need to be merged but isn't this also limiting the disk IO throughput that I can achive to ths throughput of this single machine?


(Christian Dahlqvist) #4

As you are looking to efficiently identify the max temperature per city, why not rout on a city instead as this probably will distribute processing and data more evenly and use all nodes? This would should also allow more work to be done at the shard level and less data needing to be transferred between nodes.


(Robert Krombholz) #5

Thanks for the hint. Will try that. Even after reading all the documentation it's not that obvious what routing strategy is the most efficient.
Was doing this based on country to actually avoid work distribution (to avoid coordination and sorting overheads).


(system) #6