Challenges with date/time range selection when combining snapshot and timeseries data

Hi there, My organization is very new to Elasticsearch and Kibana. Historically we were using sql server for our analytics. We are doing a PoC to determine if Elastic and Kibana will with our scenario.

We have 12 months of historical snapshot data in sql server that we are importing into Elasticsearch. The existing data model is quirky. In order to keep a track of historic changes, we have a daily snapshot of users' activities.

We want to do a few point in time metrics and over time analytics. Let's say my point in time metric that I want to display in a tile count of unique users actions and my overtime metric is user activity overtime.
Because the index is a snapshot for every day of the year... if my time period says Last week... my overtime activity looks good... but my metric is 7 times more than it should be. Because it is looking data for the whole week.

I want data for the last day of the week (end date of the datetime filter) for the metric tile and data for the whole week for the over-time analytics. I am not able to figure out how to do that in 1 dashboard.

Any help would be much appreciated!
Thanks

Hi,

I have some questions.

"a daily snapshot of users' activities."

It is a bit difficult for me to understand. Could you share some small example? I'm not sure what is "a daily snapshot of users' activities." What is the reason you don't store activity logs themselves?

but my metric is 7 times more than it should be. Because it is looking data for the whole week.

Why the data in Monday (for exaxmple) shows up again in other day? Do you store latest activities for each users as daily snapshot?

Finaly, is there any room to rebuilt a new data model which suit for Elasticsearch?

I am so glad you reached out. I will try my best to explain the current state. I am open to ideas on building a data model that fits elastic and kibana. I am a little stumped on how to do that.

Let me describe it in terms of example:

My system is a skill management system. For example let's say me an engineering employee in an organization is assigned a role of a software engineer and data engineer. Based on the company's business process I will be assessed on my skills. My data structure will look the following.

Answers I want from this data
Is user compliant for all assigned roles? If you set time range for year 2021. Jane is compliant with expecations. However if you set time range for last 6 months, she is no longer compliant because she has a new role assigned and her assessments don't meet expectations

Is user compliant for software engineering role? Yes for year 2021 and 2022
Is user compliant for data engineering role? N/A for year 2021 and no for 2022
Assessments conducted over time? if you set time range to year 2021 total completed assessments are 4. If run time rante to year 2022, total completed assessments are 10 (inclusive of assessments completed in 2021) Because once an assessment is done... we count it completed unless it is unassigned or expires

some gotchas are:

an assessment doesn't take place daily... but once it takes place the status of compliant or not complaint stays with the user/skill combination until the next one takes place. Which means if I come check status on day 1 and the user is compliant... a month later if no other assessment took place the user is still compliant. But in kibana if my date range changes... how do I ensure the user is calculated to be compliant

At a single point in time... let's say I have 6 skills in the system I want to see how many skills are assessed and how many still needs to be assessed. How do I define that when I take a look at a specific date range... if the assessments took before date range it will not be counted... if the take a snapshot for each day than it is counted multiple times :confounded:

once again. Thank you so much for your help. If you have any direction for me please let me know.

Because the index is a snapshot for every day of the year... if my time period says Last week... my overtime activity looks good... but my metric is 7 times more than it should be. Because it is looking data for the whole week.

I would check out Transforms - there's a Latest transform that sounds like it could help

Hi there, thank you for your response. I have been studying the transform feature since you mentioned it. I am a little confused about the licensing aspect. We are still in the process of testing elastic and kibana for our use. We have basic licensing for our cluster. It seems like transform is part of xpack, which confuses me. Is transform included as part of basic licensing?

Thanks

as of Elastic 7.3 transforms is enabled by default and if you don't add any specific node roles in your elastic yml then all nodes are transform nodes by default.

I am able to get lastest for records for my data set... Next question... let's say I have from the beginning of time (last 8 years) using the the latest transform I can see the current state if my start date is set to last 8 years ago and end date is set to now... This is awesome and I love it... is there a way to see to an over time trend...

So in my above example as different roles and assessments were done for an employee the employees compliance changed over time.

if I was using my latest transform, and wanted to see the compliance change, in last 12 months, I could run following 12 queries:
1 - 8 years ago to 12 months ago
2 - 8 years ago to 11 months ago
3 - 8 years ago to 10 months ago
.
.
12- 8 years ago to now

this way I will get the data I am looking for, but how do I visualize all 12 months over time.
I tried date histogram... but that didn't work as it was just breaking the data-view on the time field... I want multiple time ranges, where each time range is from the beginning of time.

any help and direction is much appreciated.
Thank you so much!

Hi, thank you for clarify the situation. Then you need relation between the two tables. Elasticsearch doesn't supoprt JOIN function as the way RDB supports in nature.

In such case, often there are some solutions but they are sometimes quite individual to discuss in a forum like this. For example, it depends on whther there are "unassignment" for skill, or repetition of assignment/unassignment for the same skill. Other things to consider are what kind of visualization you need, the size of data, desired performance of the queries, ... etc.

For now, Kibana is a good tool to visualize the aggregated results of flat documents. If you have daily data, you may visualize the aggregation of them.

P.S. latest transform is a good way to query on latest condition, but it does not suit for query on past period as "for year 2021" or "for year 2022".

Thank you for the response. I am able to get away with JOIN by nesting the necessary data within the json object.

My current challenge is getting "latest" dataset over set intervals so I can do overtime visualization and how the cumulative metrics are changing overtime.

Month 1 "latest data" user's average is 90%
Month 2 "latest data" user's average is 70%
Month 3 "latest data" user's average is 95%

because in my data set at a certain point in time past matters... I would be very helpful to provide that visualization. I don't want to create a historic snapshots, I feel like I have the history... I should be able to do an analysis.

Any recommendations for that?

Though I'm not sure it is the best way and it is just an idea , there is cumulative sum aggregation in Elasticsearch. If you store not the asessment score itself but the difference from the previosu assessment score, you will be able to get the lastest score at regular time points with date_hist aggregation and cumulative sum aggregation.

I tried another way without success:

Using "gap_policy": "keep_values" of bucket_script aggregation on date_hist aggregation about latest score calculated by top_metrics aggregation or scripted_metric aggregation but just got aggregation_execution_exception because:

buckets_path must reference either a number value or a single value numeric metric aggregation, got: [InternalTopMetrics] at aggregation [latest_score_in_buckets]

buckets_path must reference either a number value or a single value numeric metric aggregation, got: [InternalScriptedMetric] at aggregation [latest_score_in_buckets]

I have been investigating Rollup job for my scenario.

I was thinking to combine 3 pieces

  • ETL to bring raw data, each assessment - index-raw
  • use continuous latest transform to get latest assessments -> index-latest
  • use rollup job to summarize by role and user -> index-rolled up
  • use pivot transform on rolled up job to pull role information for each user. -> index-pivot

My question is, can I create a rolled up job where each run results in a new index?

Sorry, I'm not familiar with rollup jobs...

But I've understood that it is designed to "rollup" old data to less granurality for more rough scale aggregation and not to create another type of data. It is something like compression special for some supported aggregation. Therefore, the all aggregation which can work on rolled up index should work on the original index.

So Rollup Job wont work in my scenario. You are correct it's primary purpose is to create less granular data. There are two issues, One a date histogram field is required and second the only aggregation it allows is sum/avg/max/min/value_count... Whereas, in my scenario none of those aggregation makes sense... Lets say you are software engineer that has a skillset of data modelling.

day 1 - you could be assessed as expert
day 45 - you could be assessed as basic

latest day 15 works... and I want that state to be remembered... so if I query 6 months old data... your assessment is accurately represented.

I think I am going back to storing daily snapshots of data...

I got a question about date querying

Let's say I store daily snapshot, is there a way for me to query last day of month data in the last year

for my date range would say gte: 5/24/2021 and lte: 5/24/2022 ... but instead of returning everything for me... give me documents that are for for last day of each month

  • 5/31/2021
  • 6/30/2021
  • 7/31/2021
  • 8/31/2021
    .
    .
  • 5/24/2022

I decided to solve this problem another way... I added a flag with the record as true the ones that were last day of the month. I use talend to import my data. It is a lot easier to do that logic in java.

Next issue, the final data model that I decided on to help build my visualization is a daily snapshot. I get a record per user per assigned skill. Following is the question I want to answer with this data.

  • Count of users that are > 75% compliant for a given a day
  • Count of users that are <50% compliant for a given a day

Compliance is a calculated number based on Is meeting expectations (simple average)

Getting to a data table for users with compliance % is easy... but the next step where I can get a count of users that are >75% I am not sure how to do.

Can someone please help.

Sample data

I suppos there are two options.

  1. average aggregation on user terms & day datehisto aggregation, and use bucket selector aggregation.
  2. use transform with average aggregation on user terms & day datehisto aggregation, then you can use any default query on destination index.

Both worth try.

Tomo_M, First of all, thank you so much for continuing to help me! Our organization is new this journey and I really appreciate your help. Based on your feedback, I am leaning towards option 1 as I have 70million documents a month, it may be cheaper on storage cost.

so I think I am partway there. I have 2 questions now.
First question: I can't see a way to do bucket selector in kibana, I see avg, max, min, sum bucket but not a bucket selector.

Second question: in my query I have the list of users that meet the bucket selector criteria... now all I need is count of those users so I can display it in kibana. How can I do that?

Once again, Thank you so much!

In such case, I recommend the option 2 because it may significantly exceed the default max_buckets limitation. You can increase the settings but I'm not sure it works with millions of buckets. And also you have to increase the size parameter of terms aggregation to cover all users.

To visualize the historical change of the compliance user ratio, you need 2 stage aggregation (use average aggregation inspite of bucket_selector aggregation), which is not supported in most part of Kibana as you said.

The following is my example how to calculate the average/count of users with over 75% compliance ratio (just for experiment).

PUT test_aggregation

POST test_aggregation/_doc
{
  "day": "2022-05-01",
  "user": "Jane",
  "skill role": "SE",
  "skill": "Java",
  "assessment_date": "2021-04-31",
  "current_rating": 3,
  "required_rating": 3,
  "is_meeting_expectation": true
}

POST test_aggregation/_doc
{
  "day": "2022-05-01",
  "user": "Jane",
  "skill role": "SE",
  "skill": "python",
  "assessment_date": "2021-04-31",
  "current_rating": 3,
  "required_rating": 3,
  "is_meeting_expectation": true
}

POST test_aggregation/_doc
{
  "day": "2022-05-01",
  "user": "Jane",
  "skill role": "SE",
  "skill": "sql",
  "assessment_date": "2021-04-31",
  "current_rating": 3,
  "required_rating": 4,
  "is_meeting_expectation": false
}

POST test_aggregation/_doc
{
  "day": "2022-05-02",
  "user": "Jane",
  "skill role": "SE",
  "skill": "Java",
  "assessment_date": "2021-04-31",
  "current_rating": 3,
  "required_rating": 3,
  "is_meeting_expectation": true
}

POST test_aggregation/_doc
{
  "day": "2022-05-02",
  "user": "Jane",
  "skill role": "SE",
  "skill": "python",
  "assessment_date": "2021-04-31",
  "current_rating": 3,
  "required_rating": 3,
  "is_meeting_expectation": true
}

POST test_aggregation/_doc
{
  "day": "2022-05-02",
  "user": "Jane",
  "skill role": "SE",
  "skill": "sql",
  "assessment_date": "2022-05-02",
  "current_rating": 4,
  "required_rating": 4,
  "is_meeting_expectation": true
}

POST test_aggregation/_doc
{
  "day": "2022-05-01",
  "user": "John",
  "skill role": "SE",
  "skill": "Java",
  "assessment_date": "2021-04-31",
  "current_rating": 3,
  "required_rating": 3,
  "is_meeting_expectation": true
}

POST test_aggregation/_doc
{
  "day": "2022-05-01",
  "user": "John",
  "skill role": "SE",
  "skill": "python",
  "assessment_date": "2021-04-31",
  "current_rating": 2,
  "required_rating": 3,
  "is_meeting_expectation": false
}

POST test_aggregation/_doc
{
  "day": "2022-05-01",
  "user": "John",
  "skill role": "SE",
  "skill": "sql",
  "assessment_date": "2021-04-31",
  "current_rating": 3,
  "required_rating": 4,
  "is_meeting_expectation": false
}

POST test_aggregation/_doc
{
  "day": "2022-05-02",
  "user": "John",
  "skill role": "SE",
  "skill": "Java",
  "assessment_date": "2021-04-31",
  "current_rating": 3,
  "required_rating": 3,
  "is_meeting_expectation": true
}

POST test_aggregation/_doc
{
  "day": "2022-05-02",
  "user": "John",
  "skill role": "SE",
  "skill": "python",
  "assessment_date": "2021-04-31",
  "current_rating": 2,
  "required_rating": 3,
  "is_meeting_expectation": false
}

POST test_aggregation/_doc
{
  "day": "2022-05-02",
  "user": "John",
  "skill role": "SE",
  "skill": "sql",
  "assessment_date": "2022-05-02",
  "current_rating": 2,
  "required_rating": 4,
  "is_meeting_expectation": false
}

GET test_aggregation/_search
{
  "size":0,
  "aggs":{
    "date": {
      "date_histogram": {
        "field": "day",
        "interval": "day"
      },
      "aggs": {
        "user":{
          "terms":{
            "field": "user.keyword",
            "size": 10
          },
          "aggs": {
            "compliance_ratio": {
              "avg": {
                "field": "is_meeting_expectation"
              }
            },
            "75%":{
              "bucket_script": {
                "buckets_path": {"var": "compliance_ratio"},
                "script": "params.var > 0.75?1:0;"
              }
            }
          }
        },
        "average_compiance_ratio":{
          "avg_bucket": {
            "buckets_path": "user>compliance_ratio"
          }
        },
        "75%_ratio":{
          "avg_bucket": {
            "buckets_path": "user>75%"
          }
        },
        "75%_count":{
          "sum_bucket": {
            "buckets_path": "user>75%"
          }
        }
      }
    }
  }
}

# what whill happen if it exceeds the "size" limit of terms agg.
GET test_aggregation/_search
{
  "size":0,
  "aggs":{
    "date": {
      "date_histogram": {
        "field": "day",
        "interval": "day"
      },
      "aggs": {
        "user":{
          "terms":{
            "field": "user.keyword",
            "size": 1
          },
          "aggs": {
            "compliance_ratio": {
              "avg": {
                "field": "is_meeting_expectation"
              }
            },
            "75%":{
              "bucket_script": {
                "buckets_path": {"var": "compliance_ratio"},
                "script": "params.var > 0.75?1:0;"
              }
            }
          }
        },
        "average_compiance_ratio":{
          "avg_bucket": {
            "buckets_path": "user>compliance_ratio"
          }
        },
        "75%_ratio":{
          "avg_bucket": {
            "buckets_path": "user>75%"
          }
        },
        "75%_count":{
          "sum_bucket": {
            "buckets_path": "user>75%"
          }
        }
      }
    }
  }
}

With transform (option 2),

PUT test_aggregation_transform
{
  "mappings": {
    "properties": {
      "user":{"type":"keyword"},
      "date":{"type":"date"},
      "compliance_ratio":{"type":"float"}
    }
  }
}

PUT _transform/test_aggregation_transform
{
  "source": {
    "index": "test_aggregation"
  },
  "dest":{
    "index": "test_aggregation_transform"
  },
  "pivot": {
    "aggregations": {
      "compliance_ratio": {
        "avg": {
          "field": "is_meeting_expectation"
        }
      }
    },
    "group_by": {
      "date": {
        "date_histogram": {
          "field": "day",
          "calendar_interval": "day"
        }
      },
      "user":{
        "terms": {
          "field": "user.keyword"
        }
      }
    }
  }
}

POST _transform/test_aggregation_transform/_start

# check if the first transform completed
GET _transform/test_aggregation_transform/_stats

GET /test_aggregation_transform/_search

You can visualize the historical change of compliance ratio over users, for example using formula in Lens on the destination index.

Keep in mind using customize time interval on horizontal axis.

Or you can create 75% compliance flag within transform.

POST _transform/_preview
{
  "source": {
    "index": "test_aggregation"
  },
  "dest":{
    "index": "test_aggregation_transform"
  },
  "pivot": {
    "aggregations": {
      "compliance_ratio": {
        "avg": {
          "field": "is_meeting_expectation"
        }
      },
      "75%":{
        "bucket_script": {
          "buckets_path": {"var": "compliance_ratio"},
          "script": "params.var > 0.75?1:0;"
        }
      }
    },
    "group_by": {
      "date": {
        "date_histogram": {
          "field": "day",
          "calendar_interval": "day"
        }
      },
      "user":{
        "terms": {
          "field": "user.keyword"
        }
      }
    }
  }
}

Hi Tomo!
I want to thank you for your help. I followed your recommendation and now have data model that fits Elasticsearch and kibana capabilities. I now have 3 snapshots of our data model to summarize/aggregate data at a granularity that we want to provide visualizations.

My last challenge is filter. We have created filters object and want to disable default kibana search so our data model is not exposed to end users.

now we have visualizations on the same dashboard that is pulled from 3 different indexes. but the filters are tied to 1 index and that is causing an issue.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.