I have weblogs for a bunch of sites, and the logs go into ES with either site.com or www.site.com (and sometimes, but not often, a.site.comb.site.com and so on)
I'm thinking about splitting this on the way in into site.com and prefixes, so I can easily visualise using a terms agg for 'top 50 sites' in Kibana
Currently I can't figure out how to merge the totals for www.site.com and site.com in a Kibana vis - is there a way I can do this without reindexing?
The cleanest way to do this would probably be to fix the data, but in Kibana what I would probably do is create two separate visualizations, one that filters for site.com and www.site.com, and one that filters for the inverse (NOT site.com AND NOT www.site.com). Then you could put them side by side in a dashboard to get a view across the entire data set.
How would that solve for 'top 50 sites' ? If I were going to make 50 visualisations, I'd be better off making 50 filters and using a single viz with a filter agg
It sounded to me like you essentially wanted a top 50 terms agg with the counts for site.com and www.site.com combined. I'm not sure how you'd achieve that with a single visualization, so my thinking was that you could create one visualization with the count for site.com + www.site.com and a second visualization with a top 50 terms agg on just the subdomains.
But perhaps I've misinterpreted your question. Could you provide a little more detail on what the data looks like? I assumed the domain was static and only the subdomains change, maybe that's incorrect?
Ah I see, that is more complicated. So for site3, www.site3.com, site3.com, and weird.site3.com should all count towards site3.com in the terms agg, is that right?
Since you already seem to be using Groovy scripting, I assume you've tried creating a scripted field that strips the subdomain? Does that not work for some reason?
[quote="Bargs, post:6, topic:47251"]
Ah I see, that is more complicated. So for site3, www.site3.com, site3.com, and weird.site3.com should all count towards site3.com in the terms agg, is that right?[/quote]
Exactly
Cool idea - would you be able to give an example of how this would be done?
I imagine you could split the string on dots, remove the first element if there are greater than 2 array elements (in other words, there's a subdomain), and then rejoin with dots? Or maybe a regex would work, but I imagine that would be slower.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.