Retrieve geolocation sampled data

Hello,
I have an index that collects 1 geospatial point data every second per each device (eg, a moving car).
Now, I need to collect the geopoint in order to plot the path that the given vehicle has made. However, plotting a path of 2 hours long with a such frequency ends up with 3600*2 datapoints, that is definitely too much for this use case (however, I do need such resolution for other use cases).
Is there a way to query the index and retrieve only a sample of those 7200 points?

Thanks a lot!
Marco

We have a sampler aggregation which can take the top N hits and feed them to a contained child aggregation.
Additionally there's a diversified_sampler which may be useful to ensure the selection of docs is not focused in any one particular time range or location.

Wonderful!

Thanks a lot Mark.

Elastic is a masterpiece!

1 Like

Hello Mark,

I'm experiencing problems with the diversified_sampler query.
This is the query I'm setting up:

GET gps_data/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term" : {
            "deviceId": 123
          }
        },
        {
          "range": {
            "start": {
              "gte": 1562047171000,
              "lte": 1562047514000,
              "format": "epoch_millis"
            }
          }
        }
      ],
      "filter": [],
      "should": [],
      "must_not": []
    }
  },
  "aggs": {
        "my_unbiased_sample": {
            "diversified_sampler": {
                "shard_size": 200,
                "field" : "eventTime"
            },
            "aggs": {
                "keywords": {
                    "significant_terms": {
                        "field": "location"
                    }
                }
            }
        }
    }
}

Elastic replies back that "significant_terms aggregation cannot be applied to field [location]. It can only be applied to numeric or string fields." I'm wondering if I'm setting up the query correctly or actually there is this limitation in the query.

Thanks!
Marco

Wasn't the goal to put your geo aggregation under the sampler?
You may need to have a higher-granularity field for diversification too - if the accuracy is millisecond level you'll only be limiting the number of docs considered per millisecond. You might need to use a script to "round up" the times to hours or minutes or whatever suits as your unit for de-duplicating

Ok, got your point.

I have to refactor the index a bit than, I want to avoid scripted fields to maintain performances high enough.

Meanwhile I’m using the functional query with a random function. This is creating the uniform scoring that allows me to drop out enough data and have a rough good quality samples.

Thanks again for your support.

1 Like

Sounds like a good approach.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.