I'm thinking about splitting this on the way in into site.com and prefixes, so I can easily visualise using a terms agg for 'top 50 sites' in Kibana
The cleanest way to do this would probably be to fix the data, but in Kibana what I would probably do is create two separate visualizations, one that filters for site.com and www.site.com, and one that filters for the inverse (NOT site.com AND NOT www.site.com). Then you could put them side by side in a dashboard to get a view across the entire data set.
How would that solve for 'top 50 sites' ? If I were going to make 50 visualisations, I'd be better off making 50 filters and using a single viz with a filter agg
It sounded to me like you essentially wanted a top 50 terms agg with the counts for
www.site.com combined. I'm not sure how you'd achieve that with a single visualization, so my thinking was that you could create one visualization with the count for
www.site.com and a second visualization with a top 50 terms agg on just the subdomains.
But perhaps I've misinterpreted your question. Could you provide a little more detail on what the data looks like? I assumed the domain was static and only the subdomains change, maybe that's incorrect?
Ah I see, that is more complicated. So for site3,
weird.site3.com should all count towards site3.com in the terms agg, is that right?
Since you already seem to be using Groovy scripting, I assume you've tried creating a scripted field that strips the subdomain? Does that not work for some reason?
[quote="Bargs, post:6, topic:47251"]
Ah I see, that is more complicated. So for site3, www.site3.com, site3.com, and weird.site3.com should all count towards site3.com in the terms agg, is that right?[/quote]
Cool idea - would you be able to give an example of how this would be done?
I imagine you could split the string on dots, remove the first element if there are greater than 2 array elements (in other words, there's a subdomain), and then rejoin with dots? Or maybe a regex would work, but I imagine that would be slower.
You might also be able to accomplish this with a value script in the advanced options of the terms agg itself instead of a scripted field: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_value_script_8