Modeling Out Tags

(Jonmorehouse) #1

I'm building out a time series database in elastic search where I'd like to support the idea of "aggregating tags". For instance, my initial data set has about 100k unique objects, which store 1 document per hour (with a delta of statistics).

Each of the 100k objects can have a multitude of different tags. I could imagine us having upwards of 1000 different tags and we would be expecting to be able to return a time series response for any tag set for any time range.

I started out by storing tags as an array of strings and filtering based upon that. As a second pass I stored each tag as a boolean field on each object (and saw ~40% speed up). This second approach wouldn't scale I don't believe because I'd have about 1000 unique fields. Could anyone point me in the right direction to design something like this?

Here's some simple python code with my exact query:

def fetch_timeseries_buckets_by_tag_via_bucket_ids():
    Alternative solution for creating bucketed, time-series data over a large number of documents.

    In the previous method, we do a few things: 
        1.) we filter all the objects in the index set by tag 
        2.) we filter all objects in the index set to make sure their date
        ranges are in the correct set

    Once that happens, we're given (for a set of 100k documents, over a 24 hour
    period) 24*100k or 2.4 million documents.

    As the query iterates over each document, it breaks it down into ranges ...
    where each range corresponds to a timestamp. This involves comparing the
    timestamp of the document being iterated over to an upper/lower bound timestamp.

    Instead of explicitly using dates or timestamps for comparison, we can
    group elements directly by labeling "buckets" with an integer. For
    instance, at write time, this would mean that the worker would need to
    "calculate" a bucket-identifier that this delta falls in.

    This could be a canonical id, integer that would be easily mapped back to a
    timestamp at read time by clients of this system.
    uri = 'time_series/object/_search?search_type=count'
    lower_bound = 0 # corresponds to some arbitrary day, say Jan 1
    upper_bound = 24 # corresponds to 24 time buckets later than the lower bound

    query = {
        'query': {
            'filtered': {
                'filter': {
                    'bool': {
                        # optimize this query to use a bitset so that caching will be shared between queries
                        # this is to avoid using an and operation
                        'must': [
                            # instead of writing an array of tag strings - this writes each tag as a boolean attribute
                            {'term': { 'facebook_tag': True }},
                            {'range': { 'bucket_id': { 'gt': lower_bound, 'lt': upper_bound }}},
        'aggregations': {
            'sum_by_bucket_id': {
                # this will group by the bucket id, totally avoiding the range/date comparison alltogether
                # this will also leverage the cache better and will allow for better warming techniques
                'terms': {
                    # note ... we don't care about the timestamp ... we care about what "bucket" the data lies in :)
                    'keyed': True,
                    'field': 'bucket_id',
                    # by default, the result set only returns 10 sets. Ensure that all the buckets are returned
                    'size': upper_bound - lower_bound,

                'aggregations': {
                    'views': {
                        'sum': { 'field': 'views' },
                    'shares': {
                        'sum': { 'field': 'shares' },

(system) #2