Ways to speed up a date range query with an aggregation of a parent child field

shocking_black · May 10, 2020, 6:00pm

Am trying to implement a "trending" feature for our video site based on date-tagged user favorites of videos.

Our schema consists of Video documents with Favorites implemented as a child document of Videos, and each Favorite document contains the date at which it was favorited.

The query should get the most Favorited videos within the last day/week/month.

1 node, 1 shard, ~200k Video documents with ~250m Favorites documents.

The following query runs in about 1.5 seconds on average:

   post('/videos/_search', json={
    'profile': True
    'query': {
        'range': {
            'favorite_date': {
                'gte': 'now-800h/d'
            }
        }
    },
    'aggs': {
        'num_favorites_for_range': {
            'terms': {
                'field': 'relation_type#video',
                'size': 50
            }
        }
    }
}

The profile is as follows:

     'profile': {'shards': [{'id': '[M2vh8UPMQxqo4AUbvAC5ng][videos][0]',
    'searches': [{'query': [{'type': 'IndexOrDocValuesQuery',
        'description': 'favorite_date:[1586217600000 TO 9223372036854775807]',
        'time_in_nanos': 376529245,
        'breakdown': {'set_min_competitive_score_count': 0,
         'match_count': 0,
         'shallow_advance_count': 0,
         'set_min_competitive_score': 0,
         'next_doc': 183424428,
         'match': 0,
         'next_doc_count': 7264182,
         'score_count': 7264182,
         'compute_max_score_count': 0,
         'compute_max_score': 0,
         'advance': 686340,
         'advance_count': 32,
         'score': 137075517,
         'build_scorer_count': 64,
         'create_weight': 157,
         'shallow_advance': 0,
         'create_weight_count': 1,
         'build_scorer': 40814342}}],
      'rewrite_time': 1156,
      'collector': [{'name': 'MultiCollector',
        'reason': 'search_multi',
        'time_in_nanos': 1155358974,
        'children': [{'name': 'SimpleTopScoreDocCollector',
          'reason': 'search_top_hits',
          'time_in_nanos': 229478297},
         {'name': 'ProfilingAggregator: [num_favorites_for_range]',
          'reason': 'aggregation',
          'time_in_nanos': 402608004}]}]}],
    'aggregations': [{'type': 'GlobalOrdinalsStringTermsAggregator',
      'description': 'num_favorites_for_range',
      'time_in_nanos': 392044951,
      'breakdown': {'reduce': 0,
       'build_aggregation': 22914528,
       'build_aggregation_count': 1,
       'initialize': 4256,
       'initialize_count': 1,
       'reduce_count': 0,
       'collect': 361861983,
       'collect_count': 7264182}}]}]}}

As you can see, we've avoided doing a direct has_parent query as these are known to be slow.

Surprisingly, the aggregation is not the slowest part of the query.

Any suggestions for how this query could be sped up?(sub 100ms would be optimal).

system · June 7, 2020, 6:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.