Kibana alternate way to remove duplicates or get precise Unique Count

Hi All,

i have 60000 documents in an index. Many of these documents have same value for field "BookId".

NOTE:-There are 14000 unique BookId's just many duplicates because they have different values in other fields of other documents and creating 60000 total hits in an index.

I am creating a Bar Chart visualization with "Category" in X-axis and Unique count of "BookId" in Y-axis metric. But it produces wrong Unique count, upon some searching it says that it is for approximation and setting JSON to {"precision_threshold" : 40000} would solve it. But it is still missing thousands of value.

If it is an approx value. how can I get the unique count/remove duplicates in my Bar Graph ?

Also, Can i filter out the unique in DISCOVER tab so it shows right count in hits?

Above 40K, the results are still fuzzy, unfortunately, as documented here. Since you have 60K records, I think you're still going to see fuzzy cardinality results.

We have a client-side scripting language (Kibana expressions), which you could probably put to use here, although it's not quite ready to go for the bar chart.

If you really want precision, you may need to write a plugin that tallies things. Here's an example of getting a distinct count of "category" per "city" from a "pets" index, using JavaScript. You can test this locally by running Kibana like this: yarn start --repl. After Kibana boots, you can paste this code into the REPL (in your terminal), and then enter clientDistinct(), and you should see an accurate distinct count. You'll want to modify the query to actually select the fields / index you want.

async function clientDistinct(kbnServer) {
  const callCluster = kbnServer.server.plugins.elasticsearch.getCluster('admin').callWithInternalUser;
  const result = {};
  let from = 0;

  while (true) {
    // callCluster is a function which calls Elasticsearch, and may
    // not be exactly what you'd use...
    const { hits } = await callCluster('search', { 
      index: 'pets',
      body: {
        "from" : from,
        "_source" : {
          "includes" : [
            "category",
            "city"
          ],
          "excludes" : [ ]
        },
        "sort" : [
          {
            "_doc" : {
              "order" : "asc"
            }
          }
        ]
      }
    });

    if (!hits || !hits.hits.length) {
      break;
    }

    from += hits.hits.length;

    // This does a distinct count of categories grouped by city
    hits.hits.forEach(({ _source }) => {
      const set = result[_source.city] || new Set();
      result[_source.city] = set;
      set.add(_source.category);
    });
  }

  // Returns something like: { newyork: 3, seattle: 55 }
  return Object.keys(result).reduce((acc, k) => {
    acc[k] = result[k].size;
    return acc;
  }, {});
}

I should note that this is fairly trivial to do in Canvas:

essql query="SELECT city, category FROM pets"
| ply by=city fn={math "unique(category)"}

You'll need to modify that to query the index, and changeby=city and unique(category) to be whatever columns you're working on.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.