Scripted metric aggregations & sorting

I tried to select by a scripted_metric aggregations & sort it by timestamp field.

The result seems to be divided into sorted arrays in the aggregation bucket.
However, the arrays themselves are not sorted between themselves (see the screenshot).

  1. Why are there many sub-arrays in the result?
  2. Why the sub-arrays not sorted between themselves?

Query:
"aggregations": {
"ccpairTerm": {
"terms": {
"field": "ccpair"
},
"aggregations": {
"timestampTerm": {
"scripted_metric": {
"init_script": {
"source": "params._agg.lpSendingTime=;params._agg.ccpairs=;params._agg.tnetServerNames=;params._agg.platformNames=",
"lang": "painless"
},
"map_script": {
"source": "params._agg.lpSendingTime.add(doc.lpSendingTime.value);params._agg.ccpairs.add(doc.ccpair.value);params._agg.tnetServerNames.add(doc.tnetServerName.value);params._agg.platformNames.add(doc.platformName.value)",
"lang": "painless"
},
"combine_script": {
"source": """
List result;
result = ;
params._agg.lpSendingTime.sort((x, y) -> (int)(x.getMillis() - y.getMillis()));
for (int i = 0 ; i < params._agg.lpSendingTime.length-1 ; i++)
{
if (params._agg.lpSendingTime[i + 1].getMillis() - params._agg.lpSendingTime[i].getMillis() > 10) {
result.add(params._agg.lpSendingTime[i].getMillis())
}
} return result;
""",
"lang": "painless"
}
}
}
}
}

Result:
image

1 Like

you have not defined a reduce_script in your scripted_metric aggregation above so what you are seeing is the raw result of the combine_script from each shard. You need to define a reduce_script which takes the result from each shard and merges them together into your final result. See the following documentation for more information: https://www.elastic.co/guide/en/elasticsearch/reference/6.3/search-aggregations-metrics-scripted-metric-aggregation.html

Also I wonder what your use case for using the scripted_metric aggregation is here? Often there are ways of achieving what you need without using the scripted_metric aggregation and instead combining the script feature in another aggregation instead

1 Like

Thanks for the fast reply!

The data is a collection of timestamps.
I want to know if there is a gap between 2 consecutive timestamps (index split to more than one shard - 2 consecutive timestamps can on different shards).

  1. Is it possible to gather all data and sort it after?
  2. How can we do it with a script?
  3. How can we do it without script (Query only)?

If you wanted to do the analytics offline you could use the scroll API to stream all the data out of Elasticsearch and do your calculations on your client.

The problem with your approach here is that you are potentially going to need to stream a lot of data from the shards to the coordinating node because the number of timestamps could be large. You could mitigate this in two ways:

  1. When indexing documents use routing to route documents with the same ccPairs value to the same shard - This way you are guaranteed that all timestamps for a term bucket are on the same shard. You will still need to do complex processing though so will likely still need to use the scripted_metric aggregation.
  2. Have a secondary index where each document represents a ccPairs value and contains the information about whether there is a gap and at what timestamps - This involves having a job that runs periodically, collects new data from the primary index and merges that new data into the relevant documents int eh secondary index. At query time you can than run normal aggregations to obtain the information you need if you structure the documents in the secondary index in appropriate ways to show this data.

If your data volumes are small enough that your current approach (after you add a reduce script) seems to be working well then you can continue with this approach but it might be worth keeping the above in mind if your data volume increases and performance starts to suffer.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.