Faster way to count maximum cardinality?

klahnakoski · October 19, 2018, 4:27pm

I am collecting metadata on properties:

{
	"aggs":{
		"card":{"cardinality":{"field":"task.dependencies"}},
		"multi":{"max":{"script":"doc[\"task.dependencies\"].values.size()"}}
	},
	"size":0
}

The first is cardinality, which helps decide if using terms aggregation is a reasonable thing to do. The second is "multi" which counts the number of terms a single document property can have. These "multi-valued" properties must be treated different from single-values properties when aggregating.

{"script":"doc[\"task.dependencies\"].values.size()"} is slow. Is there a faster way to detect if a property has been given more-than-one value?

Thank you

klahnakoski · October 20, 2018, 11:41am

Since it is uncommon for single-valued properties to suddenly have multiple values, and multi-valued properties are commonly multi-valued (like text type), then I need not scan all the data; performing this aggregation on only a recent subset. I make a new index for each week of data. I chose a property that limits my query to recent data. I assume the script is only executed on a single index (or two), rather than all indexes. This reduces the script workload, and the query runs faster.

{
    "aggs":{
        "count":{"cardinality":{"field":"task.dependencies"}},
        "_filter":{
            "aggs":{"multi":{"max":{"script":"doc[\"task.dependencies\"].values.size()"}}},
            "filter":{"bool":{"should":[
                {"term":{"etl.timestamp.~n~":1539388800}},
                {"missing":{"field":"etl.timestamp.~n~"}}
            ]}}
        }
    },
    "size":0
}

system · November 17, 2018, 11:42am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.