Sorting aggregation results with Java API

I have below query that buckets results and sorts them in the order I want, but I am unable to translate this to the Java API.

GET /my_data_set/_search
{
  "size": 0,
  "aggregations": {
    "toBeOrdered": {
      "terms": {
        "field": "Content.keyword",
        "size": 1000000,
        "order": [
          {
            "topSort": "asc"
          },
          {
            "nextSort": "asc"
          }
        ]
      },
      "aggregations": {
        "topAnswer": {
          "top_hits": {
            "size": 20,
            "from": 0,
            "sort": {
              "DateTime": "asc"
            },
            "_source": { "includes": ["LastName", "Content", "DateTime"]}
          }
        },
        "topSort": {
          "max": {
            "field": "LastName.raw"
          }
        },
        "nextSort": {
          "max": {
            "field": "Content.raw"
          }
        }
      }
    }
  }
}

An example document is

{
   "Content": ...,
   "LastName": ...,
   "DateTime": ...
}

Edit 1:

The basic idea of the GET request above is the following. Imagine looking for duplicate contents and taking the oldest one. After obtaining each oldest content value you want to sort by the last names of the authors.

Sample input documents.

{
    "Content": "My first tweet",
    "LastName": "Doe",
    "DateTime": "2020-01-29'T'00:00:01"
},
{
    "Content": "My first tweet",
    "LastName": "Smith",
    "DateTime": "2020-01-29'T'01:00:00"
},
{
    "Content": "Some other tweet",
    "LastName": "Locke",
    "DateTime": "2020-01-30'T'00:00:01"
}

I would want to receive a result of

{
    "Content": "My first tweet",
    "LastName": "Doe",
    "DateTime": "2020-01-29'T'00:00:01"
},
{
    "Content": "Some other tweet",
    "LastName": "Locke",
    "DateTime": "2020-01-30'T'00:00:01"
}

The tweets by Doe and Smith were bucketed together and Doe's tweet was kept because they have the same "Content" value but Doe posted first. In the result we have Doe's tweet before Locke's tweet because the last name Doe comes before Locke alphabetically.

How do I write this in Java code?

I haven't checked, but the following should translate to your requirement:

List<BucketOrder> bucketOrderList = new ArrayList<>();
bucketOrderList.add(BucketOrder.aggregation("topSort", true));
bucketOrderList.add(BucketOrder.aggregation("nextSort", true));
String[] topHitIncludes = {"LastName", "Content", "DateTime"};
TermsAggregationBuilder termsAggregation = AggregationBuilders.terms("Content.keyword")
    .size(1000000).order(bucketOrderList);
termsAggregation
    .subAggregation(AggregationBuilders.topHits("topAnswer").size(10).from(0)
        .sort("DateTime", org.elasticsearch.search.sort.SortOrder.ASC)
        .fetchSource(topHitIncludes, null))
    .subAggregation(AggregationBuilders.max("LastName.raw"))
    .subAggregation(AggregationBuilders.max("Content.raw"));

Links to documentations: TermsAggregationBuilder, BucketOrder

Thanks. I think the sorts "topSort" and "nextSort" are applied within the buckets, instead of across the buckets' representatives. Is there a way to get the buckets, flatten them and then sort, all inside of Elasticsearch?

I modified your code a little, I think in two places you put the field value where the name is expected. Here is what I am using.

    List<BucketOrder> bucketOrderList = query.getSorts().stream().map(sort ->
      BucketOrder.aggregation(sort.getField() + "-sort", sort.getDirection() == Sort.Direction.ASC)
    ).collect(Collectors.toList());

    String[] topHitIncludes = new String[query.getSelections().get(indexName).size()];
    query.getSelections().get(indexName).toArray(topHitIncludes);
    TermsAggregationBuilder termsAggregation = AggregationBuilders
            .terms("terms-aggregation")
            .field(query.getCollapser().getField() + ".keyword")
            .size(10)
            .order(bucketOrderList);

    termsAggregation
            .subAggregation(
                    AggregationBuilders
                            .topHits("topAnswer")
                            .size(1)
                            .from(0)
                            .sort("DateTime", org.elasticsearch.search.sort.SortOrder.ASC)
                            .fetchSource(topHitIncludes, null)
            );

    query.getSorts().forEach(sort -> {
      if (sort.getDirection() == Sort.Direction.ASC) {
        termsAggregation.subAggregation(
                AggregationBuilders.min(sort.getField() + "-sort").field(sort.getField() + ".raw")
        );
      } else {
        termsAggregation.subAggregation(
                AggregationBuilders.max(sort.getField() + "-sort").field(sort.getField() + ".raw")
        );
      }
    });

    searchBuilder.aggregation(termsAggregation);

And I get results that look like this, not ordered as expected.

{
  "result": [
    {
      "Content": "Stand-alone systemic toolset",
      "LastName": "35cabb8a-fc11-47c6-b1cf-7bb24781506c",      
      "DateTime": "2015-07-30T23:30:09Z"
    },
    {
      "Content": "Programmable scalable capability",
      "LastName": "15ee0f86-3b5b-405c-8411-9b0c7439eafe",
      "DateTime": "2013-01-23T19:01:06Z"
    },
    {
      "Content": "Mandatory methodical knowledge base",
      "LastName": "9f3c52ef-6ff1-4f0c-8d6e-ff2ab386ff87",
      "DateTime": "1988-09-05T03:46:51Z"
    },
    {
      "Content": "Phased client-server paradigm",
      "LastName": "ab0c3680-a771-4c44-9c68-d6bd71080725",
      "DateTime": "1981-03-20T20:24:43Z"
    },
    {
      "Content": "Object-based fresh-thinking open architecture",
      "LastName": "215070ff-f637-45a3-ae4f-f32229ad7be9",
      "DateTime": "2009-07-20T18:42:54Z"
    },
    {
      "Content": "Inverse local groupware",
      "LastName": "cef240ed-bbeb-41f5-80ac-a72eb95e28b0",
      "DateTime": "1987-12-09T00:42:15Z"
    },
    {
      "Content": "Extended explicit project",
      "LastName": "0df4ef72-48e7-4b82-9d49-e19071059701",
      "DateTime": "1996-05-13T03:20:03Z"
    },
    {
      "Content": "Cross-group web-enabled capability",
      "LastName": "399acef6-1379-42f4-aadf-0c6f7122022b",
      "DateTime": "1986-05-08T23:57:34Z"
    },
    {
      "Content": "Quality-focused transitional portal",
      "LastName": "289cb6bb-bb54-4ec1-81b7-bcda7a0aa46f",
      "DateTime": "2013-03-04T10:08:21Z"
    }
  ],
  "totalCount": 2000
}

Thoughts? Would it be worth my time to collect the buckets in my service layer and sort the flattened result list there?

It looks like the Java code you provided matches the GET low-level call I gave above. Unfortunately, I need something else -- it seems I gave you a bad request to start with. Sorry.

The final results are not sorted across buckets. One representative is taken from each bucket (the oldest), and then sort the representatives by other features.

I'm sorry but I didn't quite understand your requirement. It would be easy for me, if you provide the exact query that you are looking to translate into java code.

Unfortunately, it turns out that I don't have a correct exact query for what I want to do. So I'll need your help constructing either. All I have is the following requirement.

The idea of the request is the following. Imagine looking for duplicate contents and taking the oldest one. After obtaining each oldest content value you want to sort by the last names of the authors.

Sample input documents.

{
    "Content": "My first tweet",
    "LastName": "Doe",
    "DateTime": "2020-01-29'T'00:00:01"
},
{
    "Content": "My first tweet",
    "LastName": "Smith",
    "DateTime": "2020-01-29'T'01:00:00"
},
{
    "Content": "Some other tweet",
    "LastName": "Locke",
    "DateTime": "2020-01-30'T'00:00:01"
}

I would want to receive a result of

{
    "Content": "My first tweet",
    "LastName": "Doe",
    "DateTime": "2020-01-29'T'00:00:01"
},
{
    "Content": "Some other tweet",
    "LastName": "Locke",
    "DateTime": "2020-01-30'T'00:00:01"
}

The tweets by Doe and Smith were bucketed together and Doe's tweet was kept because they have the same "Content" value but Doe posted first. In the result we have Doe's tweet before Locke's tweet because the last name Doe comes before Locke alphabetically.

Any help you can offer would be greatly appreciated.

I think field collapsing is what you are looking for not an aggregation.

I am going to work in Query DSL and not in JAVA rest API.

Let's use a dataset from demo.elastic.co so we can share examples in dev tools

     GET twitter-sentiment-2020.03/_search
          {
            "query": {
                "match": {
                    "extended_tweet.full_text": "Elasticsearch"
                }
            },
            "collapse" : {
                "field" : "extended_tweet.full_text.keyword"
            },
            "sort": ["@timestamp", "user.name.keyword"],
            "from": 10
        }

Would this allow you to get closer to your goal?

1 Like

Matt! That's amazing! Thanks! I am able to translate this to the high level API with ease so I'll mark this as Done and Done.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.