“distinct” query in elasticsearch java api

How to create “distinct” query in elasticsearch java api, like we do in sql. here is my query, which kibana generated, so I should do the same with java, and I'm trying many ways, but again getting with duplicates.

"aggs": {
    "3": {
      "terms": {
        "field": "policy.weight",
        "order": {
          "1": "asc"
        },
        "size": 5
      },
      "aggs": {
        "1": {
          "cardinality": {
            "field": "item.item_doc_id.keyword"
          }
        },
        "2": {
          "terms": {
            "field": "policy.name.keyword",
            "order": {
              "1": "desc"
            },
            "size": 100
          },
          "aggs": {
            "1": {
              "cardinality": {
                "field": "item.item_doc_id.keyword"
              }
            },
            "4": {
              "terms": {
                "field": "policy.description.keyword",
                "order": {
                  "1": "desc"
                },
                "size": 100
              },
              "aggs": {
                "1": {
                  "cardinality": {
                    "field": "item.item_doc_id.keyword"
                  }
                }
              }
            }
          }
        }
      }
    }
  }

and this is the code in java, what I'm trying to do, but get with duplicates.

final SearchRequest searchRequest = new SearchRequest(bucketListInfo.getIndexName());
        final String field1 = "policy.name.keyword";
        final String field2 = "policy.description.keyword";
//        final String field3 = "policy.weight";
        final String field3 = "item.item_doc_id.keyword";
        final SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder()
                .query(getRangeQueryBuilderWithOptionalFilter(bucketListInfo.getTimestampFieldName(), bucketListInfo.getFilterFieldName(), bucketListInfo.getFilterFieldValue(),
                        bucketListInfo.getFrom(), bucketListInfo.getTo())).size(100);

        TermsAggregationBuilder termsAggregationBuilder;
        AggregationBuilder cardinalityAggregationBuilder = AggregationBuilders.cardinality(field3).field(field3);
        termsAggregationBuilder = AggregationBuilders.terms(field1).field(field1).size(100);
        searchSourceBuilder.aggregation(termsAggregationBuilder);

        termsAggregationBuilder = AggregationBuilders.terms(field2).field(field2).size(100);
        searchSourceBuilder.aggregation(termsAggregationBuilder);

        searchSourceBuilder.aggregation(cardinalityAggregationBuilder);
        searchRequest.source(searchSourceBuilder);
        try {
            final SearchResponse response = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
            BucketList bucketList = new BucketList();
            final Terms terms1 = response.getAggregations().get(field1);
            final Terms terms2 = response.getAggregations().get(field2);
            Cardinality terms3 = response.getAggregations().get(field3);
            bucketList.getBuckets().add(terms1.getBuckets());
            bucketList.getBuckets().add(terms2.getBuckets());
//            bucketList.getBuckets().add(terms3.getBuckets());
            return bucketList;
        } catch (Exception e) {
            log.error(e.getMessage(), e);
            return null;
        }

So tell me please, what is the correct way of writing the java query of that kibana generated query.

Hi @hmkhitaryan

This a code adapt your query.

 SearchRequest searchRequest = new SearchRequest("idx_name");

    TermsAggregationBuilder aggsTermPolicyWeight = AggregationBuilders.terms("3")
        .field("policy.weight").size(5);

    CardinalityAggregationBuilder aggsCardinalityDocId = AggregationBuilders.cardinality("1").field("item.item_doc_id.keyword");

    aggsTermPolicyWeight.subAggregation(aggsCardinalityDocId);

    TermsAggregationBuilder aggsTermPolicyName = AggregationBuilders.terms("2")
        .field("policy.name.keyword").size(100);

    TermsAggregationBuilder aggsTermPolicyDescription = AggregationBuilders.terms("4")
        .field("policy.description.keyword").size(100);

    aggsTermPolicyDescription.subAggregation(aggsCardinalityDocId);

    aggsTermPolicyName.subAggregation(aggsCardinalityDocId);
    aggsTermPolicyName.subAggregation(aggsTermPolicyDescription);

    aggsTermPolicyWeight.subAggregation(aggsTermPolicyName);

    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    searchSourceBuilder.aggregation(aggsTermPolicyWeight);

    searchRequest.source(searchSourceBuilder);

    SearchResponse searchResponse = getClient().search(searchRequest, RequestOptions.DEFAULT);

Hi @RabBit_BR I'm trying this, but it again returns hits with duplicates.
Actually the final response should be like this:

   [
     {
       "policyName" : "policy name from db", 
       "policyDescription" : "policy desc from db",
       "count" :  the count from db
     },
     ....
   ]

the problem must be in the query. The coding is exactly the query that you presented.

What do you mean by duplicates? Be aware that cardinality aggregations are spproximations and not necessarily exact.

@Christian_Dahlqvist comming back to your question. this whole query retrieves rows, where doc_id is not uniqe, so I want it to be uniq, like dinstinct, to get the ones, where this doc_id field doesn't repeat.

Actually these two methods does the job, what I need, the only lack here is that I get the rows, where doc_id field values repeat, so this is the problem actually now, how to do so that this field values be distinct.

    public BucketList getListOfBucketsTimeRestricted(BucketListInfo bucketListInfo) {
        final SearchRequest searchRequest = new SearchRequest(bucketListInfo.getIndexName());

        final SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder()
                .query(getRangeQueryBuilderWithOptionalFilter(bucketListInfo.getTimestampFieldName(), bucketListInfo.getFilterFieldName(), bucketListInfo.getFilterFieldValue(),
                        bucketListInfo.getFrom(), bucketListInfo.getTo())).size(100);

        for (String aggrField : bucketListInfo.getAggrFieldList()) {
            final TermsAggregationBuilder aggregationBuilder = AggregationBuilders.terms(aggrField).field(aggrField);
            searchSourceBuilder.aggregation(aggregationBuilder);
        }
        searchRequest.source(searchSourceBuilder);
        try {
            final SearchResponse response = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
            BucketList bucketList = new BucketList();
            for (String aggrField : bucketListInfo.getAggrFieldList()) {
                final Terms terms = response.getAggregations().get(aggrField);
                bucketList.getBuckets().add(terms.getBuckets());
            }
            return bucketList;
        } catch (Exception e) {
            log.error(e.getMessage(), e);
            return null;
        }
    }

    private static final List<String> AGGR_FIELDS = List.of("policy.name.keyword", "policy.description.keyword");
    private static final String IS_COMPLIANT_FIELD = "is_compliant";

    private final SearchClient searchClient;

    public List<PolicyViolation> getEvaluationTimeRestricted(DateTime from, DateTime to) {
        final BucketListInfo bucketListInfo = BucketListInfo.builder()
                .indexName(EVALUATION.getIndexName())
                .timestampFieldName(EVALUATION.getTimestampFieldName())
                .aggrFieldList(AGGR_FIELDS)
                .filterFieldName(IS_COMPLIANT_FIELD)
                .filterFieldValue(false)
                .from(from)
                .to(to)
                .build();

        BucketList buckets = searchClient.getListOfBucketsTimeRestricted(bucketListInfo);
        if (Objects.isNull(buckets)) {
            return Collections.emptyList();
        }

        List<? extends Terms.Bucket> firstBuckets = buckets.getBuckets().get(0);
        List<? extends Terms.Bucket> secondBuckets = buckets.getBuckets().get(1);
        final List<PolicyViolation> evaluationCounts = new ArrayList<>();

        for (int i = 0; i < firstBuckets.size(); i++) {
            evaluationCounts.add(
                    new PolicyViolation(firstBuckets.get(i).getKeyAsString(), secondBuckets.get(i).getKeyAsString(), firstBuckets.get(i).getDocCount()));
        }
        return evaluationCounts;
    }

type or paste code here

Tbe documents retrieved if you have not set size to 0 are matching the filter selection but not affected by aggregations, so will not be unique.

@Christian_Dahlqvist will you please show me where to set that size?

Please have a look at the docs.

@Christian_Dahlqvist sorry, but I think we don't understand each other.
I want to do a "select distinct" query analog in elasticsearch java api, and then group by the results, got from that select distinct query. just that.

like this:
select distinct doc_id, ..., ... from ... where ..., group by ....

As far as I know I do not think that is possible in Elasticsearch.

no way ? :pensive:

Here is a thought...

Use the SQL API interface

Then use the translate API to get the DSL

Perhaps that will help, one word of caution. The SQL API is very picky about single quotes and double quotes etc.

@stephenb I started using the sql search api, you suggested, but the problem is now that my index name is "*-evaluation", and when I'm using the query like this :

        String query = "{\"query\":\"SELECT 'name' FROM *-evaluation\"" + ","
                + "\"filter\": {"
                   + " \"range\": {"
                      + " \"@cspm.ingested_at\": {"
                         + " \"from\": \"15/11/2022\","
                         + " \"to\": \"16/11/2022\","
                         + " \"format\": \"dd/MM/yyyy\""
                     + " }"
                + " }"
                + "}}";

so I get a very unclear json object graph, which I think is not related to the data, which I am expecting. I even excape the index name in string, like '*-evaluation', but anyway, again no exected results. So any idea how to resolve this?

I want to do a "select distinct" query analog in elasticsearch java api, and then group by the results, got from that select distinct query. just that.

I do not think there is any query construct that based on a filter returns all unique values (terms) from a field. The closest I think you can get is to perform a terms aggregation, but that will give you a count together with each term. There is also a limit to the size of the result set, so you may not get all if there are many values.

You can do nested aggregation, but I'm not sure if that's what you are looking for.
You can group by unique doc_id and within each unique doc_id, you can further group by other fields.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.