Hey,
I'm having an issue with my query timing out after 10 minutes over a large dataset and was wondering if anyone could lend a hand.
My dataset is about 700 million rows for 3 years of data (~75GB of primary shards). My data is sharded by month with 1 primary shard and 1 replica giving indices of ~2GB. When I query I choose only the shards for the relevant months.
My query (posted below) involves two filters on integers and 3 aggregations; one on an unanalysed string, one on a date, and one on an integer (that was previously filtered on). I am running my queries using the elasticsearch python dsl.
One thing that I am currently trying is to reindex my data into 6 month indices which will give indices of ~15GB. I'm just waiting for the reindex to finish.
Elasticsearch version 2.1.1
This is an example query:
{
'query': {
'filtered': {
'filter': {
'bool': {
'must': [{
'term': {
'usergroup_id': u '8369'
}
}, {
'range': {
'timeOf': {
'gte': '2016-01-01'
}
}
}, {
'range': {
'timeOf': {
'lte': '2016-01-31'
}
}
}, {
'or': {
'filters': [{
'term': {
'event_type': 4
}
}, {
'term': {
'event_type': 5
}
}, {
'term': {
'event_type': 6
}
}]
}
}]
}
},
'query': {
'match_all': {}
}
}
},
'aggs': {
'per_individual_user': {
'terms': {
'field': 'sso_user_id',
'size': 0
},
'aggs': {
'per_date_interval': {
'date_histogram': {
'extended_bounds': {
'max': '2016-01-31',
'min': '2016-01-01'
},
'min_doc_count': 0,
'interval': u 'month', # We use various date groupings
'field': 'timeOf',
'order': {
'_key': 'asc'
}
},
'aggs': {
'per_event_type': {
'terms': {
'field': 'event_type',
'min_doc_count': 0
},
'aggs': {
'usage_sum': {
'sum': {
'field': 'duration'
}
},
'usage_count': {
'value_count': {
'field': 'event_type'
}
}
}
}
}
}
}
}
}
}