Percentage repeats year over year

I am relatively new to ELK. Unfortunately I'm a sysadmin and definitely not an analytics person. This means that while I've been able to setup the ELK stack and even injest and work with data, I am pretty novice at doing anything robust with visualizations.

We have a situation where I think ELK is more than adequate for us and our customers. Unfortunately we, as a small company, have done everything manually in the past so many think that it is impossible to automate these analytic tasks. I'm hoping to learn enough to show our team that, with the proper tools, we can do things much more effeciently.

Enough back story and on to the task. I have injested roughly 8 years worth of project data from our CRM. I have mapped the @timestamp properly and the data looks good in Discover. I've even created some basic aggregation charts of top clients, etc.

The first chart the execs want to see is a chart that shows what percentage of clients which we worked for in the previous year did we also do in the present year. So basically I need a chart that looks at the client names from (year-1y) and matches those to the client names from (year) and ignores the new ones and gives a percentage of the ones from year-1y that are repeats.

Can someone give me some guidance on where to start on something like this? I've done, and will continue doing, some searching on my own but if someone out there can give me some useful info to start with, I would greatly appreciate it.

I was able to create a vertical chart with yearly unique customers. Now I just need to do some kind of percentage match with the previous year.

Post

It looks like some have been looking at something similar back in 2015. GordonM phrases it like "this month compared to this month last year (or day, year, week, etc.)"

It seems that some have been able to do similar in timelion so I am refocusing my efforts to that. Still interested in guidance though if someone's willing to help.

Hey @stevezemlicka, Timelion will allow you to compare one month to the previous month, and perform a calculation on the data. You can do the same using other Visualizations in Kibana which rely upon the serial differencing aggregration, or the derivative aggregation. However, these options let you perform math on the "sibling buckets", they won't let you determine whether a specific field from a document was in a sibling bucket...

How are you ingesting your data into Elasticsearch? Is it possible to augment your ingesting pipeline to include an additional field specifying whether or not a client is recurring, or it's the first time a client is seen?

We performed a one-time injest of a CSV for this proof of concept. Unfortunately the native data does not have this info which is why we are trying to figure out a system to get us these metrics. Both serial differencing aggregation and derivative aggregation are above my head ATM so I need to some research into those.

It would seem I could do this with some maths. Say something like number of unique documents for previous and current year minus number of uniq documents for previous year and take that value and subtract it from the number of unique documents for this year. My challenge with putting that into timelion is how do I get the cardinality of the two years together? If I sum the cardinality from the two years, this is not accurate since some are duplicated.

This is sort of what I want to do:
.es(index=*projects, metric='cardinality:Client Name.keyword').subtract(.es(index=*projects, metric='cardinality:Client Name.keyword') AND (.es(index=*projects, metric='cardinality:Client Name.keyword', offset=-1y)).subtract(.es(index=*projects, metric='cardinality:Client Name.keyword', offset=-1y))).bars(15)

[Correction]
.es(index=*projects, metric='cardinality:Client Name.keyword').subtract(.es(index=*projects, metric='cardinality:Client Name.keyword') AND (.es(index=*projects, metric='cardinality:Client Name.keyword', offset=-1y)).subtract(.es(index=*projects, metric='cardinality:Client Name.keyword', offset=-1y))).divide(.es(index=*projects, metric='cardinality:Client Name.keyword', offset=-1y)).bars(15)

The problem is the AND doesn't work between queries and .sum produces inaccurate results (due to duplicates of unique documents):
.es(index=*projects, metric='cardinality:Client Name.keyword') AND (.es(index=*projects, metric='cardinality:Client Name.keyword', offset=-1y))

I also need to change the results to a percentage and the scale to 100 but I imagine that won't be so bad once I get the main calculations done.

This also does not work:
.es(index=*projects, metric='cardinality:Client Name.keyword').subtract(.es(index=*projects, metric='cardinality:Client Name.keyword' AND metric='cardinality:Client Name.keyword', offset=-1y)).subtract(.es(index=*projects, metric='cardinality:Client Name.keyword', offset=-1y))).divide(.es(index=*projects, metric='cardinality:Client Name.keyword', offset=-1y)).bars(15)

Another way to put this, is there any way to include both an index and the offset of an index in a single query?

https://discuss.elastic.co/t/how-to-create-time-range-conditions-on-timelion/119404

This looked promising but it seems that Lucene can't do relative dates. Is there a different visualization that would be better?

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-cumulative-cardinality-aggregation.html

Is it possible to filter or otherwise carve out the years into buckets and then use cumulative cardinality to run a calculation that takes into account only two specific buckets?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.