How to show the correct count in table visualization

I have a index file with single document ID, 2 items under the single document:

"hits": [
      {
        "_index": "dremio_sys_jobs",
        "_id": "sys_jobs",
        "_score": 1,
        "_ignored": [
          "dremiojob_result.query.keyword"
        ],
        "_source": {
          "@timestamp": "2024-08-23T08:45:27.197280500Z",
          "dremiojob_result": [
            {
              "user_name": "1666678",
              "query_type": "FLIGHT",
              "queue_name": "",
              "submitted_ts": "2024-08-23 08:44:50.980",
              "job_id": "1937b3fc-a6d1-50a0-7c0e-5b70c79e0a00",
              "status": "METADATA_RETRIEVAL",
            },
            {
              "user_name": "1550279",
              "query_type": "REST",
              "queue_name": "Low Cost User Queries",
              "submitted_ts": "2024-08-23 08:45:29.096",
              "job_id": "1937b3d6-45ad-c697-4b27-e8a77c4c7700",
              "status": "RUNNING",
            }
          ]
        }
      }
	]

In the table visualization I put in "user_name" as row, and count of Job_id as matrics, I'm expecting
User Name No. of Jobs
1550279 1
1666678 1
But the visualization is showing 2 for each user. How to correct the count?

Hello,
Could you please share the API : POST {Data} by which index was created in order to create the same & try to troubleshoot further.
Thanks

Here is curl command to create test data:

curl -X POST http://127.0.0.1:9200/_bulk?pretty \
  -H "Content-Type: application/json" \
  -d'
{ "create" : { "_index" : "my_index", "_id" : "1"} }
{ "data": [{"job_id" : "AAA" , "user_name" : "AAAAAAAAAAAAAA"}, {"job_id" : "BBB" , "user_name" : "BBBBBBBBBBBBBBB"}] }
'

What is the mapping?

IMO you should index each dremiojob_result as a single document to achieve what you are looking for.

So 2 documents instead of 1.

My 2 cents

The reason I put all the results into single document, is to overwrite the document every X mins. so that the visualization can monitor the running jobs in "realtime". If each result is one document, how to refresh the document every X mins?

I'm not sure. You can filter the dataset by submitted_ts to only look at the last x minutes? I don't know the exact use case so I can't comment further.

But it seems that you have timeseries data here, which is a good fit for time based indices and data streams.

But may be explain a bit more what you would like to do.

The data source is from Dremio sys_jobs table, this table only keeps the running jobs, once the job is completed, it will be removed from this table automatically by Dremio.

The dashboard built on ELK is to monitor the running jobs such as no. of running jobs, jobs by users etc.

For instance, job AAA is retrieved from the dataset the first time as one running job, the job count is 1. If this job is a long running job, the job count is 1 too for the second query. But if this job is completed before the second query, then the job count is 0.

How to achieve this by filter submitted_ts?

So I'd probably do something like this:

PUT /dremio-2024-08-23/_doc/1937b3fc-a6d1-50a0-7c0e-5b70c79e0a00
{
  "user_name": "1666678",
  "query_type": "FLIGHT",
  "queue_name": "",
  "submitted_ts": "2024-08-23 08:44:50.980",
  "status": "METADATA_RETRIEVAL",
}

and then later:

PUT /dremio-2024-08-23/_doc/1937b3fc-a6d1-50a0-7c0e-5b70c79e0a00
{
  "user_name": "1666678",
  "query_type": "FLIGHT",
  "queue_name": "",
  "submitted_ts": "2024-08-23 08:45:25.980",
  "status": "RUNNING",
}

So each job will be replaced by its latest state.

I'm lost why two steps. The example given are two running jobs, one for each user. They are fetched at the same time from dremio sys_jobs, the total job count is 2, not 1.

on the ELK side, what jobs are running on Dremio is not known till logstash query the dataset sys_jobs via Dremio API.

That's why I use the single document in the index, and always refresh the document by X mins interval, to make sure Dremio dataset is mirrored in ELK index file. Is there any way Kibana can correctly display the metrics such as no. of running jobs owned by user?

To me Solving the Kibana issue looks much easier than creating an index with different content.

An array of documents is flattened when indexed into Elasticsearch.

So:

{
          "dremiojob_result": [
            {
              "user_name": "1666678",
              "query_type": "FLIGHT",
              "queue_name": "",
              "submitted_ts": "2024-08-23 08:44:50.980",
              "job_id": "1937b3fc-a6d1-50a0-7c0e-5b70c79e0a00",
              "status": "METADATA_RETRIEVAL",
            },
            {
              "user_name": "1550279",
              "query_type": "REST",
              "queue_name": "Low Cost User Queries",
              "submitted_ts": "2024-08-23 08:45:29.096",
              "job_id": "1937b3d6-45ad-c697-4b27-e8a77c4c7700",
              "status": "RUNNING",
            }
          ]
}

is indexed by default as:

{
  "dremiojob_result": {
     "user_name": ["1666678","1550279"],
     "query_type": ["FLIGHT","REST"],
     // Other fields skipped for clearness
     "status": ["METADATA_RETRIEVAL","RUNNING"]
  }
}

So there's no way to know which user correspond to which status...

If you want to keep inner documents as separated documents, then you need to use nested type for the dremiojob_result field. Using nested documents makes the requests more complex as you need to use nested queries and things like that. Which makes also the usage in Kibana more complex IMO.

That's why, unless there's a need for it, I'd have separated documents first and try to see if with that way, you are able to achieve what you are looking for.