How to show the correct count in table visualization

chun · August 23, 2024, 9:12am

I have a index file with single document ID, 2 items under the single document:

"hits": [
      {
        "_index": "dremio_sys_jobs",
        "_id": "sys_jobs",
        "_score": 1,
        "_ignored": [
          "dremiojob_result.query.keyword"
        ],
        "_source": {
          "@timestamp": "2024-08-23T08:45:27.197280500Z",
          "dremiojob_result": [
            {
              "user_name": "1666678",
              "query_type": "FLIGHT",
              "queue_name": "",
              "submitted_ts": "2024-08-23 08:44:50.980",
              "job_id": "1937b3fc-a6d1-50a0-7c0e-5b70c79e0a00",
              "status": "METADATA_RETRIEVAL",
            },
            {
              "user_name": "1550279",
              "query_type": "REST",
              "queue_name": "Low Cost User Queries",
              "submitted_ts": "2024-08-23 08:45:29.096",
              "job_id": "1937b3d6-45ad-c697-4b27-e8a77c4c7700",
              "status": "RUNNING",
            }
          ]
        }
      }
	]

In the table visualization I put in "user_name" as row, and count of Job_id as matrics, I'm expecting
User Name No. of Jobs
1550279 1
1666678 1
But the visualization is showing 2 for each user. How to correct the count?

Tortoise · August 23, 2024, 9:34am

Hello,
Could you please share the API : POST {Data} by which index was created in order to create the same & try to troubleshoot further.
Thanks

chun · August 26, 2024, 2:27am

Here is curl command to create test data:

curl -X POST http://127.0.0.1:9200/_bulk?pretty \
  -H "Content-Type: application/json" \
  -d'
{ "create" : { "_index" : "my_index", "_id" : "1"} }
{ "data": [{"job_id" : "AAA" , "user_name" : "AAAAAAAAAAAAAA"}, {"job_id" : "BBB" , "user_name" : "BBBBBBBBBBBBBBB"}] }
'

dadoonet · August 26, 2024, 4:14am

What is the mapping?

IMO you should index each dremiojob_result as a single document to achieve what you are looking for.

So 2 documents instead of 1.

My 2 cents

chun · August 26, 2024, 6:55am

The reason I put all the results into single document, is to overwrite the document every X mins. so that the visualization can monitor the running jobs in "realtime". If each result is one document, how to refresh the document every X mins?

dadoonet · August 26, 2024, 8:36am

I'm not sure. You can filter the dataset by submitted_ts to only look at the last x minutes? I don't know the exact use case so I can't comment further.

But it seems that you have timeseries data here, which is a good fit for time based indices and data streams.

But may be explain a bit more what you would like to do.

chun · August 26, 2024, 9:10am

The data source is from Dremio sys_jobs table, this table only keeps the running jobs, once the job is completed, it will be removed from this table automatically by Dremio.

The dashboard built on ELK is to monitor the running jobs such as no. of running jobs, jobs by users etc.

For instance, job AAA is retrieved from the dataset the first time as one running job, the job count is 1. If this job is a long running job, the job count is 1 too for the second query. But if this job is completed before the second query, then the job count is 0.

How to achieve this by filter submitted_ts?

dadoonet · August 26, 2024, 10:04am

So I'd probably do something like this:

PUT /dremio-2024-08-23/_doc/1937b3fc-a6d1-50a0-7c0e-5b70c79e0a00
{
  "user_name": "1666678",
  "query_type": "FLIGHT",
  "queue_name": "",
  "submitted_ts": "2024-08-23 08:44:50.980",
  "status": "METADATA_RETRIEVAL",
}

and then later:

PUT /dremio-2024-08-23/_doc/1937b3fc-a6d1-50a0-7c0e-5b70c79e0a00
{
  "user_name": "1666678",
  "query_type": "FLIGHT",
  "queue_name": "",
  "submitted_ts": "2024-08-23 08:45:25.980",
  "status": "RUNNING",
}

So each job will be replaced by its latest state.

chun · August 26, 2024, 11:03am

I'm lost why two steps. The example given are two running jobs, one for each user. They are fetched at the same time from dremio sys_jobs, the total job count is 2, not 1.

on the ELK side, what jobs are running on Dremio is not known till logstash query the dataset sys_jobs via Dremio API.

That's why I use the single document in the index, and always refresh the document by X mins interval, to make sure Dremio dataset is mirrored in ELK index file. Is there any way Kibana can correctly display the metrics such as no. of running jobs owned by user?

To me Solving the Kibana issue looks much easier than creating an index with different content.

dadoonet · August 26, 2024, 11:34am

An array of documents is flattened when indexed into Elasticsearch.

So:

{
          "dremiojob_result": [
            {
              "user_name": "1666678",
              "query_type": "FLIGHT",
              "queue_name": "",
              "submitted_ts": "2024-08-23 08:44:50.980",
              "job_id": "1937b3fc-a6d1-50a0-7c0e-5b70c79e0a00",
              "status": "METADATA_RETRIEVAL",
            },
            {
              "user_name": "1550279",
              "query_type": "REST",
              "queue_name": "Low Cost User Queries",
              "submitted_ts": "2024-08-23 08:45:29.096",
              "job_id": "1937b3d6-45ad-c697-4b27-e8a77c4c7700",
              "status": "RUNNING",
            }
          ]
}

is indexed by default as:

{
  "dremiojob_result": {
     "user_name": ["1666678","1550279"],
     "query_type": ["FLIGHT","REST"],
     // Other fields skipped for clearness
     "status": ["METADATA_RETRIEVAL","RUNNING"]
  }
}

So there's no way to know which user correspond to which status...

If you want to keep inner documents as separated documents, then you need to use nested type for the dremiojob_result field. Using nested documents makes the requests more complex as you need to use nested queries and things like that. Which makes also the usage in Kibana more complex IMO.

That's why, unless there's a need for it, I'd have separated documents first and try to see if with that way, you are able to achieve what you are looking for.

Topic		Replies	Views
Create visualization with filter count Kibana	1	688	April 9, 2019
Visualization of errors aggregated by exception name and stack trace Kibana	3	1187	February 28, 2022
Is there any way to count two similar elements together in kibana visualization? Kibana	4	507	August 20, 2020
Table visualisation on last occurrence of specific field Kibana	1	424	June 16, 2020
Kibana not showing accurate count Kibana	7	1809	April 11, 2018

How to show the correct count in table visualization

Related topics