How to limit results with same field value to X amount of documents every time it shows up multiple times in a row

I have an issue that I am not quite sure how to solve. Really hope someone here can help me figure out how to go about it.

Imagine that I have 100 documents, all with user_id fields. I know that most documents are from different user_ids, but documents 1-10 and 20-29 are from the same user_id.

What I want to do, is make sure that I only see the the latest two documents whenever a the same user_id is returned in a row more than twice. So if user_id 1 shows up more than twice in a row, I want to limit those documents. I want this to happen every time it happens for that user_id, not limit it completely after that.


If I just request all documents as they are indexed now, I would get a result like:

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]


What I am loooking for, us a way to make sure these groups of 1s, are limited to two documents in a row, like so:

[1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 1, 12, ...]

Notice that 1, 1, ..., 1, 1, ... happens here, meaning that the rows of identical user ids have been cut down to two, instead of removing them all together, which would result in something like:

[1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...]


I also want this to work if the request is paginated (Multiple queries).

So imagine that I request the first two pages, with a size of 5, then I would like to get:

Page1: [1, 1, 2, 3, 4]

Page2: [5, 6, 7, 8, 9]

Instead of:

Page1: [1, 1, 2, 3, 4]

Page2: [1, 1, 1, 1, 1]


I hope that I have described the issue well enough for someone to understand. If not, then please let me know so I can try explaining it another way.

Hi Søren,
Field collapsing sounds like the closest thing to what you need.

1 Like

Thank you for the answer Mark.

I have looked into this, and it does seem like this is what I need.

However, I am confused as to how I can paginate on this. If I set a size, that's just the top results, but the inner_hits can have several results. So if I set an outer size of 10, there could really be up to 20 hits, as each top hit will have 1-2 inner hits.

This would result in a weird response, when the client has requested 10, but get 20.

Is there a common way around this that I am missing?

Just trying to understand the usecase. Are you using aggregations?
Or just a plain query to match and retrieve the documents?

Can your problem be described as Latest 2 documents per user_id
But then user_id won't appear again in the response if you use terms aggregation unless you use another sub-aggregation/field to group-by

1 Like

Hi @wiouser - You understand my problem. I want the latest two documents, based on user_id and never see it again on any other pages. I do not use any aggregation, just a plain query.

Then you have two options.
If you want a limited number of user_ids, you can use terms aggregation along with top_hits as sub-aggregation.
If you want to paginate results use composite aggregation along with top_hits as sub-aggregation. I have given the terms and top_hits aggregation example below.

GET myindex/_search
{
  "size": 0,
  "aggs": {
    "user_list": {
      "terms": {
        "field": "user_id",
        "size": 10
      },
      "aggs": {
        "topdocuments": {
          "top_hits": {
            "size": 2,
            "sort": [{"time": {"order": "desc"}}]
          }
        }
      }
    }
  }
}

I have sorted using time field to return the latest two documents.
Take a look at top_hits aggregation for more options.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html

1 Like

@wiouser Thank you for the answer. I tried it and it does indeed work, but as with the solution that @Mark_Harwood suggested, here, I also end up with 10 groups with 1-2 items per user.

So what confuses me is if I want to paginate on this result, the client will request ?page=1&limit=10, but in the aggregated result set, I have 10-20 items, so I would have to discard some of it, to fulfill the promise of limit=10

I might be missing some basic ES knowledge here

Discarding results might not be the end of the world. We often discard matches behind the scenes when pulling the top N of something from a distributed set of data stores.

For my case I cannot discard it. We're building a feed with users latest two posts. A user can upload multiple posts in a row, but we do not want the feed flooded by the same user, if they upload 10+ posts in one go, so we want to make sure it's only the last two.

So for every page, I would have to decide on a few things that would now get shown. It would be random if I just start picking things out on each request, to make sure there are only 10 per page.

Not sure what my options are here and if ES can help me, out of the box.

Try terms value source in composite aggregation instead of terms aggregation. You have to send after_key for every subsequent request.
It also support size and order and you can have top_hits as sub aggregation fior composite as in the previous example.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.