Anomaly detection - partitioning data

Hi,

I've been working with the anomaly detection functionality in Kibana. I've got about 500 records processed, all of them have a transaction.name and a custom id that works as an tenant indicator (I'll call it tenantId from now).
So I create a anomaly detection job, with these detectors:
count partitionfield="transaction.name"
count partitionfield="tenntId"
and I get it properly partitioned by all the values in these fields throughout the dataset. However when I try to use a by field, it becomes weird:
count by "tenantId" partitionfield="transaction.name"
count by "transaction.name" partitionfield="tenantId"
I get the choice to select a combination of tenantId - transaction.name, however the dropdowns (on single metric view) are somewhat lacking in data, I can choose like 1 tenantId and 1 transaction.name. Tried to use this a few times and can't really get around this...
What I'm trying to achieve here is to detect anomalies in each tenant and by API method calls because some tenants may be much more busy than others and some method calls may be more used than others.
So any ideas why I can't get a full set of possible tenantIds and transaction.names in those dropdowns?

Please report the version you are using when you post questions. This is relevant specifically in this case because the behavior of the UI has changed with respect to this over time (see https://github.com/elastic/kibana/issues/52618 for example).

Also, adding screenshots to your posts is very helpful for us.

I'm working with version 7.9.2

Thanks for the info. Now, when you say "somewhat lacking data" - I need to know what you mean by that.

Because it is possible that each combination of transaction.name and tenantId results in sparse data just by its nature.

Are you sure you really need the double-split here? Why not just the single split (perhaps partitioning on transaction.name and just leaving tenantId as an influencer?

So when data is sparse, the options may not show in the dropdowns? So maybe if all the tenants generate enough data it will show everything? Because it behaves awkwardly - when I do count by "tenantId" partitionfield="transaction.name" it returns only tenantId=2 and one transaction.name, and when I reverse it - count by "transaction.name" partitionfield="tenantId" I get tenantId=52 and a different transaction.name.

Unfortunately it's a requirement for the project I'm working on to have the double split here.

I was cautioning about sparse data with respect to the modeling. If you "oversplit" the data, you may have a situation in which the unique combination of the by_field and the partition_field doesn't really occur very frequently, thus giving the possibility of a reduced number of observations and inadequate modeling.

In the UI, the dropdowns will only show entities that have anomaly records. For example, I just did a contrived job of count by request.keyword partitionfield=geo.src on the sample kibana web logs data set (there are 175+ unique request.keyword values and 160+ unique geo.src values). However, on execution of my job, only 7 anomalies are found in the data set, but all for the combination of geo.src:"CN" and request.keyword:"/beats/metricbeat". Therefore, the UI looks like:

In other words, the dropdown doesn't show any of the other values of geo-src or request.keyword

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.