Elasticsearch get all unique Ids

Thanks to all who read this for your help.

Background: I have a job that when executing inserts a document. The job can fail and run again, inserting a second document. Both have the same linkId- so I know there is only one job. I need the count of ALL unique linkIds - no matter how many there are.

I first looked at cardinality but that has a limit of 40,000 (precision_threshold) after which it becomes inaccurate.

Then I looked into cumulative_cardinality but that strategy uses cardinality. But the documentation makes no mention of inaccuracy. Is this method accurate for large numbers?

My aggs section looks like this:

  "aggs": {
    "unique_attempts": {
      "date_histogram": {
        "field": "timestamp",
        "calendar_interval": "day"
      },
      "aggs": {
        "distinct_attempts": {
          "cardinality": {
            "field": "linkId"
          }
        },
        "total_attempts": {
          "cumulative_cardinality": {
            "buckets_path": "distinct_attempts"
          }
        }
      }
    }
  }

So is this an approach that will give me an accurate count of unique localId? Or do I make multiple calls using the API and join the results myself? If API - what strategy would be best?

Many thanks this is not an easy question

Have you considered using a transform to create a separate index with a single document per localId? This would allow you to query, filter and count the number of documents, which would give exact values. The trade-off is that it may lag a little.

1 Like

@iamtheschmitzer

@Christian_Dahlqvist provided and excellent advice.

The Latest transform could be a great solution for this...added benefit the last document details are also saved

No, first I've heard of it. I'll take a look. Thanks!

It's Elasticsearch, not ElasticSearch by the way :slight_smile:

:+1:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.