Aggregate On Key Within Date / Time Range Multiple Buckets

Chris_Bartos · March 20, 2019, 3:08pm

Hello,

I'm not really sure how to ask this question succinctly, so maybe I can give an example of what I'm trying to do. I have indexed 30 minute timeslots for a room reservation system. Maybe something like this:

[ { room: "room1", date: "2019-04-01", start_time: "00:00:00", taken: false },
{ room: "room1", date: "2019-04-01", start_time: "00:30:00", taken: false },
{ room: "room1", date: "2019-04-01", start_time: "01:00:00", taken: false },
{ room: "room1", date: "2019-04-01", start_time: "01:30:00", taken: false },
{ room: "room1", date: "2019-04-01", start_time: "02:00:00", taken: true },
{ room: "room1", date: "2019-04-01", start_time: "02:30:00", taken: true },
{ room: "room1", date: "2019-04-01", start_time: "03:00:00", taken: false },
{ room: "room1", date: "2019-04-01", start_time: "03:30:00", taken: false },
{ room: "room1", date: "2019-04-01", start_time: "04:00:00", taken: false }, ... ]

You can imagine there are multiple dates, and multiple rooms being indexed but this is a tiny example.

The query I would like to make is something like:

"give me rooms that are available on April 1, 2019 for 2 hours (4 available adjacent timeslots)"

And, I'd get back "room1" in perhaps in buckets:

{ "aggregations": {
"range": {
"buckets": [
{ "from": "00:00:00",
"to": "01:30:00",
doc_count: 4
}
]
}
}

Or, something similar. I don't know if this is possible.

Thanks,
Chris

BenTrent · March 20, 2019, 5:37pm

Hey @Chris_Bartos,

This is a good question :D. I have been thinking about possible solutions. One I came up with is if you can make start_time an integer, then I have something that will get you started.
assuming your docs:

POST _bulk
{ "index" : { "_index" : "room_tmp", "_type" : "_doc" } }
{ "room": "room1", "date": "2019-04-01", "start_time": 0,  "taken": false }
{ "index" : { "_index" : "room_tmp", "_type" : "_doc" } }
{ "room": "room1", "date": "2019-04-01", "start_time": 30, "taken": false }
{ "index" : { "_index" : "room_tmp", "_type" : "_doc" } }
{ "room": "room1", "date": "2019-04-01", "start_time": 60, "taken": false }
{ "index" : { "_index" : "room_tmp", "_type" : "_doc" } }
{ "room": "room1", "date": "2019-04-01", "start_time": 90, "taken": false }
{ "index" : { "_index" : "room_tmp", "_type" : "_doc" } }
{ "room": "room1", "date": "2019-04-01", "start_time": 120, "taken": true }
{ "index" : { "_index" : "room_tmp", "_type" : "_doc" } }
{ "room": "room1", "date": "2019-04-01", "start_time": 150, "taken": true }
{ "index" : { "_index" : "room_tmp", "_type" : "_doc" } }
{ "room": "room1", "date": "2019-04-01", "start_time": 180, "taken": false }
{ "index" : { "_index" : "room_tmp", "_type" : "_doc" } }
{ "room": "room1", "date": "2019-04-01", "start_time": 210, "taken": false }
{ "index" : { "_index" : "room_tmp", "_type" : "_doc" } }
{ "room": "room1", "date": "2019-04-01", "start_time": 240, "taken": false }

Here is are some aggregations to get you going. The idea here is we restrict the query to the given day, do a terms agg on the room, and split those up into chunks that match the time you want (in minutes).

GET room_tmp/_search
{
  "size": 0,
  "query": { // put anything really in the query
    "constant_score": {
      "filter": {"term": {
        "date": "2019-04-01"
      }}
    }
  },
  "aggs": {
    "by-room": {
      "terms": {
        "field": "room.keyword",
        "size": 10 //Whatever number of rooms...
      },
      "aggs": {
        "every-interval": {
          "histogram": {
            "field": "start_time",
            "interval": 120 // The number of minutes you care about
          },
          "aggs": {
            "num-taken": {
              "sum": {
                "field": "taken"
              }
            },
            "is-available": {
              "bucket_selector": {
                "buckets_path": {
                  "taken": "num-taken"
                },
                "script": "params.taken == 0" // all the `taken` entries are false
              }
            }
          }
        }
      }
    }
  }
}

Something even better would be to make start_time a date field type.

PUT room_tmp
{
  "mappings" : {
      "_doc" : {
        "properties" : {
          "date" : {
            "type" : "date"
          },
          "room" : {
            "type" : "keyword"
          },
          "start_time" : {
            "type" : "date",
            "format": "YYYY-MM-DD HH:mm:ss"
          },
          "taken" : {
            "type" : "boolean"
          }
        }
      }
    }
}

Example doc:

{ "room": "room1", "date": "2019-04-01", "start_time": "2019-04-01 03:30:00", "taken": false }

Then the agg could be:

GET room_tmp/_search
{
  "size": 0,
  "query": {
    "constant_score": {
      "filter": {"term": {
        "date": "2019-04-01"
      }}
    }
  },
  "aggs": {
    "by-room": {
      "terms": {
        "field": "room",
        "size": 10
      },
      "aggs": {
        "every-interval": {
          "date_histogram": {
            "field": "start_time",
            "interval": "120m"
          },
          "aggs": {
            "num-taken": {
              "sum": {
                "field": "taken"
              }
            },
            "is-available": {
              "bucket_selector": {
                "buckets_path": {
                  "taken": "num-taken"
                },
                "script": "params.taken == 0"
              }
            }
          }
        }
      }
    }
  }
}

Chris_Bartos · March 20, 2019, 5:56pm

That looks very helpful!

Thank you very much,
Chris

Chris_Bartos · March 21, 2019, 2:12pm

It doesn't seem to work perfectly. It works good, but not ideal.

Right now, it checks each bucket like so:

Check 1,2,3,4 for all "taken": false
Check 5,6,7,8 for all "taken": false

This isn't what is ideal.

Instead, is there a way to do the check:

Check 1,2,3,4 for all "taken": false
Check 2,3,4,5 for all "taken": false
Check 3,4,5,6 for all "taken": false
etc....

Does that make sense? Is that possible?

BenTrent · March 22, 2019, 1:24am

@Chris_Bartos yeah, makes total sense, I am on the trail of something. It seems like you want a "rolling sum" of the taken field. Basically, "is the current slot of 30 minutes taken, and are the previous X free"?

What I do in the below aggregations is:

bucket into 30 minutes (the individual units you care about)
do summation of the taken field (which will usually just be a single value 1 or 0). Indicates if that current bucket is taken
Then do a moving_fn aggregation that is a rolling sum with a window of 2 (how many buckets in the past do you want to look? 90m slot would be current window 30m + 2 previous 30m windows).
I then do a bucket_selector to look at the "current" bucket and the previous two before it, if the sum of both numbers is 0, then we can assume that we have no taken: true in that bucket, or the previous 2.

I , I do summation of the taken field (which will usually just be a single value 1 or 0)

GET room_tmp/_search
{
  "size": 0,
  "query": {
    "constant_score": {
      "filter": {"term": {
        "date": "2019-04-01"
      }}
    }
  },
  "aggs": {
    "by-room": {
      "terms": {
        "field": "room",
        "size": 10
      },
      "aggs": {
        "every-interval": {
          "date_histogram": {
            "field": "start_time",
            "interval": "30m"
          },
          "aggs": {
            "num-taken": {
              "sum": {
                "field": "taken"
              }
            },
            "rolling-sum" : {
              "moving_fn": {
                "buckets_path": "num-taken",
                "window": 2, // The scheduling window size - 1, as we look at the "current" bucket too
                "script": "MovingFunctions.sum(values)"
              }
            },
            "is-free": {
              "bucket_selector": {
                "buckets_path": {
                  "currently_taken": "num-taken",
                  "previously_taken": "rolling-sum"
                },
                "script": "(params.currently_taken + params.previously_taken) == 0"
              }
            }
          }
        }
      }
    }
  }
}

Definitely a tricky problem

Chris_Bartos · March 22, 2019, 2:57pm

I had something similar to that, but I was confused why moving_fn was working differently than what I would expect it to work. (Bug in ElasticSearch (Aggs and moving_fn) v6.6.2)

It looks like it works but the output is alittle confusing:

If I do "window: 1" (for 1 hour timeslot (2 30m slots)):

If the first 2 slots are open, the query will return the first two timeslots.
For all the other slots, the buckets will return the last date (so, it looks like if you subtract 1 hour in this example, it indeed will be available for the entire window).

So, it definitely works, I'm just confused what the window is actually doing and why it makes you have to get a sum of the last couple timeslots. I was hoping it would be as simple as:

get a date histogram of all the timeslots each with the sum of field: taken (1 for taken, 0 for not taken)
move a window for the number of timeslots you want (2 for 1 hour, 3 for 1.5 hours, etc.) and sum all the timeslots for the field: taken.
create buckets where sum == 0.

Anyway, thanks a lot for your help! Just glad it works

BenTrent · March 22, 2019, 3:03pm

The moving_fn only looks at previous buckets. It does not take its "current bucket" into account. So, if you do the whole window you want (instead of subtracting one for the current), you will miss the scenario for when the "current bucket" is not available.

As for why it is implemented this way, I am not 100% sure, I am sure there is a good technical reason for it. The main author Zach Tong is a pretty smart dude.

system · April 19, 2019, 3:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Aggregation by date Elasticsearch	2	443	July 5, 2017
ES Aggregation hit that can fit multiple buckets should be in all buckets Elasticsearch	2	547	November 15, 2017
Time-range search query help Elasticsearch	2	358	August 27, 2020
Aggregate two collections on date field Elasticsearch	1	363	September 13, 2018
Date Histogram - group by hour interval across multiple days Elasticsearch	4	6621	April 21, 2017

Aggregate On Key Within Date / Time Range Multiple Buckets

Related topics