Aggregation help needed

frankshad · November 14, 2018, 11:23am

(Simplifying slightly...) I've got an index with documents containing fields "batchid", "name" and "type". Batchid is based on date, so can sort chronologically. Groups of documents are created at the same time with a common batchid and different names. Names may appear again in different batches, but only once with each batch. Batches may have as few as one document or many hundreds. There may be a few thousand batches, but they are purged over time so don't grow indefinitely.

What I want to produce is a table initially of the M (20, say) most recent batches for some specific type, but which can be extended by M at a time to go back in time. And within each batch, the first N (3, say) documents in name order, again with the ability to retrieve more for just that batch in small amounts (maybe N, or maybe a bit more - say 3 the first time and then in blocks of 10 or 20).

Currently, I've got a really simple aggregate to obtain all the unique batchid in descending order, and then I'm looping to do an ordinary search with from, size and sort to obtain the small number of documents from each batch. I'm hanging on to the batchid list so I can just do the search loop the next time for either one batchid (to get more docs for that batch) or add more batches at the end.

As M and N are fairly small, this seems not to be unduly slow, but it is clearly not very efficient. I started out trying to express this as a single aggregation, where it seemed I should be using top_hits, but I spent hours trying to work out the combinations that would work, looking at numerous examples in the ES docs and stackexchange etc, but without success. I couldn't get the batches in the right order, and I couldn't see how to get the later windows ('from' never seemed to be able to go in the places where it looked like I would need it).

Is this in fact possible with a single request for each sliding window? Could anyone give me some pointers, in general, but particularly how to get the later batches and later documents within a batch?

Many thanks.

abdon · November 14, 2018, 12:46pm

Rather than using aggregations, have you considered using searching in combination with field collapsing? I'm thinking that you could search for a specific "type", sort on "batchid", and then also collapse on the "batchid". You can then use inner_hits to paginate within each batch to get multiple documents per batch, sorted however you want.

frankshad · November 14, 2018, 1:16pm

Ah, thank you very much indeed - that does look like a good fit for what I'm trying to do. I'll experiment. I hadn't noticed field collapsing before.

frankshad · November 16, 2018, 5:23pm

This worked very nicely, thank you, modelled almost exactly what I was trying to do.

There was just one bit of information it didn't seem to provide, that is the total number of batchid in the collapsed search. Though from at the top level seems to work as I would expect, relative to the list of records extracted, not the uncollapsed, _total in the top level hits is the total number of records matching the uncollapsed query, not the collapsed query, and there doesn't seem to be a value that reflects this. I bet you have it to hand internally, so passing it on would be great.

In my case here it didn't really matter - it just meant I couldn't say "showing M of T", just "showing M". But T was a sufficiently large number, no one is going to scroll through that far, in my case it's like Google telling you there's a million hits. But in general being able to say how many batches there are would be useful.

Thanks again.

system · December 14, 2018, 5:23pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.