Aggregation to take the first result for every unique value of a term

Hey.

I have a bunch of documents that set an attribute - "unique_hash". There will be duplicates based on that hash, and I'm looking to return only the first results for a given value of "unique_hash", but one for every value of "unique_hash".

I've started out down the road of a "terms" aggregation with a "top_hits" aggregation, something like this:

  "agg": {                                                              
    "my_docs": {                                                 
      "terms": {                                                         
        "field": "unique_hash.keyword"                                              
      },                                                               
      "aggs": {                                                          
        "my_docs": {                                             
          "top_hits": {                                                  
            "sort": [                                                    
              {                                                        
                "created_at": "desc"                                     
              }                                                        
            ],                                                         
            "size": 1                                                    
          }                                                            
        }                                                              
      }                                                                
    }                                                                  
  }                                                                    
} 

Which I believe to be correct, and seems to behave on tiny data sets in development. On my production env, however, in order to return anything approaching the full set of buckets for my unique value of terms, I need to set a high "size" value in the outer aggregation. When I do this, my http client times out before the query concludes, and I've seen Elasticsearch exit with a "127" exit code.

The "unique_hash" value is definitely high cardinality, so I guess that is my problem?

Is there any way to get around this? I tried using partitions:

but this didn't appear to change the outcome.

Which client are you using?

Sounds worrying. What is logged?

What number of partitions did you use? Did you first run a cardinality agg to discover how many unique hashes you have and use that as a basis for deciding how many partitions might be useful?

You could also look at using the composite aggregation and the after parameter to page through values.

Hey Mark,

Thanks for the reply.

I'm using Faraday (NetHTTP) in Ruby. Timeout is 60s.

The logs show it ran out of memory:

[2018-01-19T18:05:33,250][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsea
rch[NkT2R14][search][T#7]], exiting
java.lang.OutOfMemoryError: Java heap space
        at org.apache.lucene.util.packed.PackedReaderIterator.<init>(PackedReaderIterator.java:45) ~[lucene-core-6
.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.packed.PackedInts.getReaderIteratorNoHeader(PackedInts.java:845) ~[lucene-core-6
.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader$BlockState.doReset(CompressingStored
FieldsReader.java:452) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-
01 14:43:32]

I tried setting size to 10000, and using 1000 partitions.

Composite looks very interesting indeed! Scroll for buckets is precisely my use case. Which version was it introduced in?

Hey Mark,

So I ran the cardinality aggregation, and my index has about 150k unique values for this field across 800k documents, so it's relatively high cardinality I guess.

Unfortunately, upgrading to 6 isn't quite practical just yet, so I've paused looking into composite.

Today I had a look at field_collapse, which seems like it could be a fairly practical way to achieve what I need, which effectively is just a single document for each unique value of my unique_hash field, but it is still quite slow - for a match_all about 400ms, for a filter query on a couple of fields, maybe 3s - and that's to return 10 results, this gets way worse if I request more.

This seems crazy slow, so I'm wondering if there's something else afoot here - any pointers perhaps?

For your reference, the field_collapse query I ran was:

{
  "query": {
    "match_all": {
    }
  },
  "collapse": {
    "field": "unique_hash.keyword",
    "inner_hits": {
      "name": "latest",
      "size": 1,
      "sort": [
        "created_at"
      ]
    }
  }
}
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.