Troubleshooting on a data table with aggregation in Kibana

AxelR · January 23, 2020, 10:56am

Hello,

I'm quite new to Kibana and ES, so my question might be a bit naive, I apologize in advance!

I have indexed data on production test lines and I would like to compute (and then use) some specific values.
Quick description of the data:

a tested product is identified with a serial
if the first test fails, this product can be tested on different test lines, up to 4 times (after which it is considered faulty)
all tests are timestamped (obviously)

Let's begin with two basic needs:

I am able to count the number of products, but I would like to break down this dimension by test line by keeping only the first test for each product (so that the sum of tested products on each test line equals the total number of products)
I was able to display the number of tests carried out on each product, but now I would like to display the number of products with only 1 test (which corresponds to the "first pass yield" products), ideally broken down by test lines

I guess the two questions both come from my relative ignorance about nesting aggregation, so any help would be greatly appreciated for me to overcome this initial obstacle

Fabio-sama · January 23, 2020, 11:44am

You don't have to apologize for naive questions. Guess it's perfectly normal for any beginner in any fields.

What you should apologize for, is not giving us more details about your use case

Joking apart, what is the structure of your documents? Can you paste here a sample json of one of your doc (maybe without all those fields not useful for this thread's purpose)?
Also, can you post here an image (even a sketch) of what you'd like to achieve with the data table?

Thank you ad welcome to the community!

AxelR · January 23, 2020, 1:40pm

OK, not sure how to post JSON in a clean way but here it is:

 {
  "mapping": {
    "properties": {
      "metadata": {
        "properties": {
          "equipment_code": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword"
              }
            }
          },
          "global_status": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword"
              }
            }
          },
          "product_serial": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword"
              }
            }
          },
          "product_variant": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword"
              }
            }
          },
          "seq_duration": {
            "type": "float"
          },
          "tstamp": {
            "type": "date"
          },
          "week": {
            "type": "long"
          }
        }
      },
      "num_value_real": {
        "type": "float"
      },
      "str_value_real": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "subtest_duration": {
        "type": "float"
      },
      "subtest_id": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "subtest_status": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "test_type": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

Let me further explain the structure:

each file of my corpus corresponds to a global test of a given product on a given test line
each tested product has an unique serial (product_serial)
each tested product belongs to a given variant (product_variant)
each test line has an unique id (equipment_code)
each test is timestamped, has a duration (seq_duration) and an outcome (global_status), which depends on the outcome of all subtests (subtest_id, subtest_status)
each subtest has a type (test_type), a value (num_value_real or str_value_real according its type), a duration (subtest_duration) and an outcome (subtest_status)

For my previous questions, the subtest part is not relevant.

I will further explain what I am trying to achieve ASAP!

Fabio-sama · January 23, 2020, 1:43pm

Sorry I didn't mean the mapping (which can be useful anyway). I meant a document.

Just make a simple _search call (GET your_index/_search) in the Dev Tools and paste here one of the returned documents on the right.

Also, you can go in the Discover section, select your index pattern, highlight the fields you're interested in (in the column on the left move the cursor over one field and click on the Add label appearing next to it, so to add it as a column in the main section ) and take a screenshot of the main part, so we can have an idea of how your data are structured.

AxelR · January 23, 2020, 2:47pm

Here is a discover screenshot of my data:

For your information, the 2 different statuses are either 'B' (for OK) or 'M' (for NOK), I will change it to a boolean later. Here, you can see that the global test fails since the subtest P000507 has failed.

Now, for the intended visuals: in both use cases, I imagine a vertical bar chart on which each bar corresponds to a parameter value on different weeks of year and each bar is composed of several colors

Use case 1: the bar represents the number of products which have been FIRST tested during the week (so the different colors of the bar represent the number concerning each test line)
Use case 2: for a given week and a given test line, the bar represents the number of products which are FIRST tested on this line, with 2 colors corresponding to the success / fail global status

I insist on the "FIRST", because otherwise I would be counting too many products (namely the ones which failed at least once and were retested on a different test line)

Hope I've been more accurate in my explanations than before

Fabio-sama · January 23, 2020, 3:08pm

Just make a simple _search call ( GET your_index/_search ) in the Dev Tools and paste here one of the returned documents on the right.

I have to insist to get a subset of your data so I can make a test and see if I can reproduce what you're trying to achieve. Put in fake data if they are sensitive, but provide me with something to easily replicate your environment.

A ready to use bulk API would be perfect to create a starting point.

Also, why do all the docs have the same timestamp and how can you identify the first one if so?
Again, if you can make even a hand drawn graph of what you'd like to build, it'd be extremely useful.

FYI, to paste an image it is sufficient to take the screenshot, copy it in your clipboard and simply CTRL+V it in the post while writing, without the need to redirect users to external pages with cookies and stuff

Another note (I'm editing this post while reading through all yours): if I got it right, what you call test_line is what is represented as equipment_code and what you show in your screenshot is the result of the test over a single equipment_code, which is a FAIL even if only one of the subtests for that equipment_code is a M, right?

If that's the case, wouldn't it be more appropriate to store a document for each test or at least for each equipment_code (aka test_line) with all the values of the subtest as part of it?
It does depend on what you're about to do with your data but for the types of visualization you're asking here, you need a more aggregated data in order to avoid complex (and sometimes not feasible) computations.

AxelR · January 23, 2020, 4:08pm

Sorry for the data, here is some example:

{
"_index" : "my-index",
"_type" : "_doc",
"_id" : "dpDR0W8B_GqiKl5gL8QY",
"_score" : 1.0,
"_source" : {
  "@timestamp" : "2020-01-23T09:52:24.856Z",
  "subtest_duration" : 336.931,
  "path" : "my-file.TXT",
  "subtest_name" : "BTN ADC Analog Open",
  "num_value_real" : 2.017,
  "test_type" : "Analogic",
  "metadata" : {
	"equipment_code" : "E0023",
	"global_status" : "B",
	"from_app_start" : 375771.044,
	"product_version" : "my-product-version",
	"seq_duration" : 159.953,
	"product_serial" : "4637054",
	"week" : 46,
	"product_variant" : "23783",
	"tstamp" : "2019-11-12T20:47:49.000Z"
  },
  "subtest_status" : "B",
  "subtest_id" : "P004401",
  "from_test_start" : 94.573,
  "test_name" : "ID23 : Buttons Interface"
}
}

I will investigate the bulk API later, as I prefer to focus on providing the information (even if in a messier way )

You are perfectly right, only the global test is timestamped, the subtests are not (the order is not important at the moment, but you can find it through the subtest_id)

Thanks for the tip !

Totally correct

I see what you mean! I must confess I didn't think about it but it would make a lot more sense.
I will try to modify my logstash config file in order to conform to your proposition.
Just hoping I will be logstash-proficient enough to do so

I will work on it ASAP and report back to tell you if it helped. I will also attempt to draw what I intend as Kibana visuals

AxelR · January 30, 2020, 4:58pm

Hello,
I have been quite busy trying to figure out the best index mapping for my data, but I guess it is not entirely relevant for my issue. I have therefore simplified the indexed data, I'm pretty confident I will be able to adapt the views with a more complex index mapping.

Anyway, here is a new data example:

{
        "_index" : "my_index-2020.01.30",
        "_type" : "_doc",
        "_id" : "Dju39W8Byf-Hbilm27G8",
        "_score" : 1.0,
        "_source" : {
          "product_serial" : "1234567",
          "global_status" : "B",
          "tstamp" : "2019-11-12T17:38:33.000Z",
          "product_version" : "my version",
          "equipment_code" : "E0021",
          "date" : "11-12-2019",
          "from_app_start" : 44413.381,
          "@version" : "1",
          "seq_duration" : 176.959,
          "@timestamp" : "2020-01-30T09:11:04.892Z",
          "product_variant" : "12345",
          "path" : "my-parsed-filepath.txt",
          "host" : "my-server",
          "week" : 46,
          "time" : "18:38:33",
          "year" : "2019"
        }
      }

The associated discover screenshot:

Finally, here are two graph examples of what I would like to get for my 2 use cases (see previous post) :

Use case 1 :
Use case 2 :

AxelR · February 5, 2020, 6:15pm

Hello,

Me again. In order to describe more precisely what I don't manage to do, here is a data table:

You can see that I used a row split on the timestamp, reducing the results to only the first occurrence of each serial (since I am interested only in the first test for each product).
The total count of unique tests marked as correct (global_status : "B") is 1605 (out of 1666)

Now what I want to do is replicate this behavior with an vertical bar chart (use case 2). This is what I managed to get:

The problem here is that I get 1650 for the correct tests, obviously because I wasn't able to take into account only the first test for each product...

So here is my question: does anybody know if it is possible to get such an histogram with Kibana? And how?

Thanks in advance for all people brave enough to delve into this

Fabio-sama · February 10, 2020, 3:54pm

Hi there,

sorry for disappearing but I've been quite busy. And I still am, but I'll read everything as soon as I can.

Just one question, why are you ingesting all the additional data if you're interested in only the first one? And sorry for the question (it's most likely written somewhere) but by "first" you mean the first of the week right?

I mean, if on Monday of week1 you have a serial xyz, then you will ignore all the xyz product up to the Sunday of the week1. But if a product xyz appears in week2 it has to be taken into consideration (obviously the first occurrence of week2), right?

AxelR · February 10, 2020, 5:35pm

Hi Fabio,

No problem for the delay, it's already great to have someone to interact with

Because it depends on the visualization: some (the most important, eg the first pass yield) are focused on the first occurrences of tests, others are not (in order to drill down any potential issue)

"First" means the first of all time, period. In fact, there cannot be 2 products with the same serial (serial=unique id). A given object has a unique serial and goes through a series of tests between one and 4 times (over a short period of time, let's say 1 day max) :

as soon as the overall result is good, the object exits the test line labelled as good to sell
if after 4 tests the overall result is still incorrect, the object is labelled as flawed

Fabio-sama · February 11, 2020, 10:46am

Ok so, what about "duplicate" the data over two indices like "whatever-index" and "whatever-index-unique"? On the former you could put all your data (useful for drilldown purposes) while on the latter you basically use the product_serial as document_id and simply don't ingest a document if a doc with that same id is already in the index.

That way you could make an index pattern like whatever-index* and in the visualization that requires document to be unique you filter in the search bar by _index:whatever-index-unique.

Might that be an idea to simplify whatever you have to do?

AxelR · February 11, 2020, 11:18am

I guess this could work (even if I thought this could have been feasible directly through Kibana manipulations) ...
However I'm not sure how to do so, I suppose this can be done through logstash? Or do I simply have to modify the ES index?

Fabio-sama · February 11, 2020, 11:28am

This might be done directly in Kibana (even though I need time to figure out your situation and I simply don't have it right now) but if adding an additional index is not a problem for you, I'd definitely go with that solution. It's way lighter from a computational point of view since you do not make any complex aggregation when browsing a dashboard and not so heavy from a storage p.o.v. since you're building an index of unique entries.

For sure it also depends on your specific case. But it's generally easier to run out of computational power than raw storage. What's the amount of data in this case (how many product serial per day and how heavy is each doc)?

Anyway, how are you ingesting data right now? Directly into ES? From where?

AxelR · February 11, 2020, 1:11pm

For the moment, storage is not a problem since it is a proof of concept!
However, I will ask my colleagues for the daily volume of raw data, just to be sure...

Currently, I am using Logstash to parse my log files (directly from a local directory, since it is a POC...)

Fabio-sama · February 11, 2020, 1:56pm

Ok then, ask your colleagues but generally storage is easily to ask for rather than computational power. Anyway, when I have time I'll try to think a good solution to achieve it in Kibana from a unique index.

As for the ingestion, if you're already using logstash you can do something like this in the output section:

output {
  elasticsearch {
    hosts => "localhost:9200"
    index => "whatever-index"
  }

  if [product_serial] {
    elasticsearch {
      hosts => "localhost:9200"
      index => "whatever-index-unique"
      document_id => "%{product_serial}"
      action => "create"
    }
  }
}

this way you'll insert the event in the whatever_index with a random id and the same event in the whatever-index-unique with the product_serial as id (so it won't be written if already there) if the event has the product_serial field.

AxelR · February 11, 2020, 2:26pm

OK, I tried it and it works!
I only had to be careful on the order of the input files, just to be sure to handle the oldest first.
Since the filenames are labelled with a timestamp, I just had to use file_sort_by => "path" in the input filter
Thank you very much for your help, Fabio!
Should I tag your previous answer as a solution, even if it is done outside of Kibana?

Fabio-sama · February 11, 2020, 2:41pm

Glad it works

Anyway, for now you can tag this one as a solution

Ok so, what about "duplicate" the data over two indices like "whatever-index" and "whatever-index-unique"? On the former you could put all your data (useful for drilldown purposes) while on the latter you basically use the product_serial as document_id and simply don't ingest a document if a doc with that same id is already in the index.

That way you could make an index pattern like whatever-index* and in the visualization that requires document to be unique you filter in the search bar by _index:whatever-index-unique .

Might that be an idea to simplify whatever you have to do?

If I manage to think of something we might update the post later.

AxelR · February 11, 2020, 2:42pm

Done!
Thanks again

system · March 10, 2020, 2:42pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tabulating data based on a field Kibana	3	849	July 6, 2017
Efficient aggregation of counting nested field values Kibana	7	739	February 11, 2020
Need help regarding data table visualization Kibana	2	285	September 16, 2019
Kibana Data Table to group by fields Kibana	3	8541	March 6, 2020
Usage of Array in a Kibana Data Table : issue with Count behaving as a Unique Count Kibana	8	1080	February 10, 2020

Troubleshooting on a data table with aggregation in Kibana

Related topics