Hello,
I have an index that collects 1 geospatial point data every second per each device (eg, a moving car).
Now, I need to collect the geopoint in order to plot the path that the given vehicle has made. However, plotting a path of 2 hours long with a such frequency ends up with 3600*2 datapoints, that is definitely too much for this use case (however, I do need such resolution for other use cases).
Is there a way to query the index and retrieve only a sample of those 7200 points?
We have a sampler aggregation which can take the top N hits and feed them to a contained child aggregation.
Additionally there's a diversified_sampler which may be useful to ensure the selection of docs is not focused in any one particular time range or location.
Elastic replies back that "significant_terms aggregation cannot be applied to field [location]. It can only be applied to numeric or string fields." I'm wondering if I'm setting up the query correctly or actually there is this limitation in the query.
Wasn't the goal to put your geo aggregation under the sampler?
You may need to have a higher-granularity field for diversification too - if the accuracy is millisecond level you'll only be limiting the number of docs considered per millisecond. You might need to use a script to "round up" the times to hours or minutes or whatever suits as your unit for de-duplicating
I have to refactor the index a bit than, I want to avoid scripted fields to maintain performances high enough.
Meanwhile I’m using the functional query with a random function. This is creating the uniform scoring that allows me to drop out enough data and have a rough good quality samples.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.