Group by only for the latest date in pie


(Andy Fan) #1

I have the following data.
|date |cluster_name|database_name|table_name|table_size
|2016-05-12|X|DB1|table1|100
|2016-05-12|X|DB1|table2|200
|2016-05-12|X|DB1|table3|700
|2016-05-13|X|DB1|table1|100
|2016-05-13|X|DB1|table2|200
|2016-05-13|X|DB1|table3|700

given some filter like cluster_name='X' and database_name='DB1'. Ican generate a beautiful report for the data size in history with line chart.

but what I want here is pie chart to show something like this:
TABLE1: 100G(10%)
TABLE2: 200G(20%)
TABLE3: 300G(70%)
this data is from the latest data.

but the issue here is if I choose time range 1 day, I can get the correct summary table.
TABLE1: 100G(10%)
TABLE2: 200G(20%)
TABLE3: 300G(70%)

but I can't get the history size trendecy.

but If I choose time range as 2 days, I can't figure out a way to show the latest status correctly.


(Tyler Smalley) #2

Hopefully, I understand what you are wanting to accomplish. You want the visualization on the dashboard to always display the current (last 24 hours) regardless of if the timepicker covering a larger period?

I was able to accomplish this by the following:

  • On Discover, add the desired filter cluster_name='X' and database_name='DB1'.
  • Select the date facet, and add one of the items to the filter.
  • Edit the date filter, you will then have a JSON representation of the filter. You can then replace it with a range:
{
  "query": {
    "range": {
      "date": {
        "gt": "now-1d"
      }
    }
  }
}

  • You can then save the search, create a visualization based on it and add it to your dashboard.

Hopefully this is what you were looking for.


(Andy Fan) #3

Thanks you. You understand me correctly.

I followed your instructions until step 4, (create a visualization based on it). I run into some new issues I didn't expect before.

I want to create a "pie", which will show the biggest table by data size. what I can do is
"Aggregation Sum(table_size)"
"split Slices" -> "table_name"
"filter by ": cluster='cluster' and table='table_name'

The issue is "we may not load the data once per day exactly", sometimes, it may more than once(like for some test?) or less than once(means not data in last 24 hours) due to some error which happens recently.

But I want the dashboard always show the latest data.
for example: If we load data twice on 2015-05-12, one is one 1:00 pm, the other one is 2:00 pm. I want to only care about the data in 2:00 pm.

and if we loss the data in the recent data, we will use the latest data in the previous day.

is it possible?

I will try my best to keep the data once per day, but it would be great that dashboard can handle such case also.

Thanks


(Tyler Smalley) #4

The only way I could see this being possible is if you were indexing the difference. Instead of indexing the size at a given time you would index the change since the previous indexing. Having this would allow you to simply SUM the size field. It would also be easier for capacity planning as you would have a cumulative graph.


(Andy Fan) #5

this will make the writer a bit of harder:)

maybe the another solution is I store the latest data into a another index. in dashboard, I just show the latest information in the other index and show the history data in the original index. actually I didn't like this one also..


(Tyler Smalley) #6

You could use an alias for your stats which points to the most recent index.

When your script inserts the data, it does so into a new index. After it's done inserting you can update the alias to point to the new index.


(Andy Fan) #7

this is a really cool suggestion and exactly resolved my uncomfortable about the this solution! I will try to do this and give a final update here. suppose everything should works pretty well now:)


(Tyler Smalley) #8

Great to hear, keep us updated!


(Andy Fan) #9

Sorry for the late update. No more unexpected things happens and the task has been completed.

The main steps includes:

  1. During a data collecting, I store the exact same data into 2 indexes. One is a new index, one is a old index which contains all the history data. The new index only contains the latest one time data.
  2. change a 'xxxsummary' index alias to the new index
  3. create pie chart to show "who is the top 5 biggest table CURRENTLY" with the xxxsummary index.
  4. at the same time, we create history tendency with the history index,
    Y Aliax: sum(table_size)
    X Alias: Terms + date + ASC + size 10000(a enough big value)
    Split line: table_name

(system) #10