Aggregation to take the first result for every unique value of a term

Thomas_Millward_Wrig · January 22, 2018, 5:13pm

Hey.

I have a bunch of documents that set an attribute - "unique_hash". There will be duplicates based on that hash, and I'm looking to return only the first results for a given value of "unique_hash", but one for every value of "unique_hash".

I've started out down the road of a "terms" aggregation with a "top_hits" aggregation, something like this:

  "agg": {                                                              
    "my_docs": {                                                 
      "terms": {                                                         
        "field": "unique_hash.keyword"                                              
      },                                                               
      "aggs": {                                                          
        "my_docs": {                                             
          "top_hits": {                                                  
            "sort": [                                                    
              {                                                        
                "created_at": "desc"                                     
              }                                                        
            ],                                                         
            "size": 1                                                    
          }                                                            
        }                                                              
      }                                                                
    }                                                                  
  }                                                                    
}

Which I believe to be correct, and seems to behave on tiny data sets in development. On my production env, however, in order to return anything approaching the full set of buckets for my unique value of terms, I need to set a high "size" value in the outer aggregation. When I do this, my http client times out before the query concludes, and I've seen Elasticsearch exit with a "127" exit code.

The "unique_hash" value is definitely high cardinality, so I guess that is my problem?

Is there any way to get around this? I tried using partitions:

but this didn't appear to change the outcome.

Mark_Harwood · January 22, 2018, 5:30pm

Which client are you using?

Sounds worrying. What is logged?

What number of partitions did you use? Did you first run a cardinality agg to discover how many unique hashes you have and use that as a basis for deciding how many partitions might be useful?

You could also look at using the composite aggregation and the after parameter to page through values.

Thomas_Millward_Wrig · January 22, 2018, 5:40pm

Hey Mark,

Thanks for the reply.

I'm using Faraday (NetHTTP) in Ruby. Timeout is 60s.

The logs show it ran out of memory:

[2018-01-19T18:05:33,250][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsea
rch[NkT2R14][search][T#7]], exiting
java.lang.OutOfMemoryError: Java heap space
        at org.apache.lucene.util.packed.PackedReaderIterator.<init>(PackedReaderIterator.java:45) ~[lucene-core-6
.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.util.packed.PackedInts.getReaderIteratorNoHeader(PackedInts.java:845) ~[lucene-core-6
.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
        at org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader$BlockState.doReset(CompressingStored
FieldsReader.java:452) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-
01 14:43:32]

I tried setting size to 10000, and using 1000 partitions.

Composite looks very interesting indeed! Scroll for buckets is precisely my use case. Which version was it introduced in?

Thomas_Millward_Wrig · January 23, 2018, 6:12pm

Hey Mark,

So I ran the cardinality aggregation, and my index has about 150k unique values for this field across 800k documents, so it's relatively high cardinality I guess.

Unfortunately, upgrading to 6 isn't quite practical just yet, so I've paused looking into composite.

Today I had a look at field_collapse, which seems like it could be a fairly practical way to achieve what I need, which effectively is just a single document for each unique value of my unique_hash field, but it is still quite slow - for a match_all about 400ms, for a filter query on a couple of fields, maybe 3s - and that's to return 10 results, this gets way worse if I request more.

This seems crazy slow, so I'm wondering if there's something else afoot here - any pointers perhaps?

For your reference, the field_collapse query I ran was:

{
  "query": {
    "match_all": {
    }
  },
  "collapse": {
    "field": "unique_hash.keyword",
    "inner_hits": {
      "name": "latest",
      "size": 1,
      "sort": [
        "created_at"
      ]
    }
  }
}

system · February 20, 2018, 6:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unique Term Values via Aggregation - Performance Considerations Elasticsearch	4	1206	January 17, 2017
How to get all unique values of a field for a single index? Elasticsearch	1	1559	February 14, 2020
Elasticsearch query for unique value Elasticsearch	3	553	August 1, 2019
Unique values of a term in an aggreation Elasticsearch	3	443	November 15, 2018
Sub aggregations on aggregations with 'limited' results (e.g. terms) Elasticsearch	4	504	July 6, 2017

Aggregation to take the first result for every unique value of a term

Related topics