Visualise '(www.|)site.com' terms agg in weblogs

I have weblogs for a bunch of sites, and the logs go into ES with either site.com or www.site.com (and sometimes, but not often, a.site.com b.site.com and so on)

I'm thinking about splitting this on the way in into site.com and prefixes, so I can easily visualise using a terms agg for 'top 50 sites' in Kibana

Currently I can't figure out how to merge the totals for www.site.com and site.com in a Kibana vis - is there a way I can do this without reindexing?

The cleanest way to do this would probably be to fix the data, but in Kibana what I would probably do is create two separate visualizations, one that filters for site.com and www.site.com, and one that filters for the inverse (NOT site.com AND NOT www.site.com). Then you could put them side by side in a dashboard to get a view across the entire data set.

How would that solve for 'top 50 sites' ? If I were going to make 50 visualisations, I'd be better off making 50 filters and using a single viz with a filter agg

It sounded to me like you essentially wanted a top 50 terms agg with the counts for site.com and www.site.com combined. I'm not sure how you'd achieve that with a single visualization, so my thinking was that you could create one visualization with the count for site.com + www.site.com and a second visualization with a top 50 terms agg on just the subdomains.

But perhaps I've misinterpreted your question. Could you provide a little more detail on what the data looks like? I assumed the domain was static and only the subdomains change, maybe that's incorrect?

site1.com
www.site1.com
site2.com


site3.com
www.site3.com
weird.site3.com
site4.com
www.site4.com
etc

Ah I see, that is more complicated. So for site3, www.site3.com, site3.com, and weird.site3.com should all count towards site3.com in the terms agg, is that right?

Since you already seem to be using Groovy scripting, I assume you've tried creating a scripted field that strips the subdomain? Does that not work for some reason?

[quote="Bargs, post:6, topic:47251"]
Ah I see, that is more complicated. So for site3, www.site3.com, site3.com, and weird.site3.com should all count towards site3.com in the terms agg, is that right?[/quote]

Exactly

Cool idea - would you be able to give an example of how this would be done?

I imagine you could split the string on dots, remove the first element if there are greater than 2 array elements (in other words, there's a subdomain), and then rejoin with dots? Or maybe a regex would work, but I imagine that would be slower.

You might also be able to accomplish this with a value script in the advanced options of the terms agg itself instead of a scripted field: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_value_script_8