we are trying to run some aggregation over around 5 million documents with
cardinality of the fields of the order of 1000 , the aggregation is a
filter aggregation which wraps underlying term aggregation . Right now
it's taking around 1.2 secs on an average to compute it , the time
increases when no. of documents are increased or I try to do multiple
aggregations. we have aws extra large machines, shards 3 and replication 2
.
1.) can we improve this time (will like it to get it within 1 sec) , I can
see very little if any of field cache being used
2.) how does this scale , it increases with number of documents , how can I
offset that (increasing nodes , replication , sharding ??)
3.) are there any better options (plugins or a different platform for
aggregating data )
we are trying to run some aggregation over around 5 million documents with
cardinality of the fields of the order of 1000 , the aggregation is a
filter aggregation which wraps underlying term aggregation . Right now
it's taking around 1.2 secs on an average to compute it , the time
increases when no. of documents are increased or I try to do multiple
aggregations. we have aws extra large machines, shards 3 and replication 2
.
1.) can we improve this time (will like it to get it within 1 sec) , I can
see very little if any of field cache being used
2.) how does this scale , it increases with number of documents , how can
I offset that (increasing nodes , replication , sharding ??)
3.) are there any better options (plugins or a different platform for
aggregating data )
}
this is a sample , the match all is usually replaced by some query
On Wednesday, 5 November 2014 19:38:42 UTC+5:30, Adrien Grand wrote:
Can you please show the json of the request that you send to elasticsearch?
On Wed, Nov 5, 2014 at 10:52 AM, Ankur Goel <ankr...@gmail.com
<javascript:>> wrote:
hi ,
we are trying to run some aggregation over around 5 million documents
with cardinality of the fields of the order of 1000 , the aggregation is a
filter aggregation which wraps underlying term aggregation . Right now
it's taking around 1.2 secs on an average to compute it , the time
increases when no. of documents are increased or I try to do multiple
aggregations. we have aws extra large machines, shards 3 and replication 2
.
1.) can we improve this time (will like it to get it within 1 sec) , I
can see very little if any of field cache being used
2.) how does this scale , it increases with number of documents , how can
I offset that (increasing nodes , replication , sharding ??)
3.) are there any better options (plugins or a different platform for
aggregating data )
On Wednesday, 5 November 2014 19:38:42 UTC+5:30, Adrien Grand wrote:
Can you please show the json of the request that you send to elasticsearch?
On Wed, Nov 5, 2014 at 10:52 AM, Ankur Goel <ankr...@gmail.com
<javascript:>> wrote:
hi ,
we are trying to run some aggregation over around 5 million documents
with cardinality of the fields of the order of 1000 , the aggregation is a
filter aggregation which wraps underlying term aggregation . Right now
it's taking around 1.2 secs on an average to compute it , the time
increases when no. of documents are increased or I try to do multiple
aggregations. we have aws extra large machines, shards 3 and replication 2
.
1.) can we improve this time (will like it to get it within 1 sec) , I
can see very little if any of field cache being used
2.) how does this scale , it increases with number of documents , how can
I offset that (increasing nodes , replication , sharding ??)
3.) are there any better options (plugins or a different platform for
aggregating data )
I assume that your revenueFilter aggregation uses an actual filter and not
a match_all filter? Otherwise you could just remove it.
Are you actually interested in the top hits that match your query? If not,
you could switch to the count search type and move the filter from your
aggregation to the filtered_query, this would be faster.
On Wednesday, 5 November 2014 19:38:42 UTC+5:30, Adrien Grand wrote:
Can you please show the json of the request that you send to
elasticsearch?
On Wed, Nov 5, 2014 at 10:52 AM, Ankur Goel ankr...@gmail.com wrote:
hi ,
we are trying to run some aggregation over around 5 million documents
with cardinality of the fields of the order of 1000 , the aggregation is a
filter aggregation which wraps underlying term aggregation . Right now
it's taking around 1.2 secs on an average to compute it , the time
increases when no. of documents are increased or I try to do multiple
aggregations. we have aws extra large machines, shards 3 and replication 2
.
1.) can we improve this time (will like it to get it within 1 sec) , I
can see very little if any of field cache being used
2.) how does this scale , it increases with number of documents , how
can I offset that (increasing nodes , replication , sharding ??)
3.) are there any better options (plugins or a different platform for
aggregating data )
we are already using count type , the filter will be an actual filter ,
we want different filters on each aggregation so it would not be possible
to do a filtered query.
Can we improve using more replications or more sharding .
On Wednesday, 12 November 2014 04:16:54 UTC+5:30, Adrien Grand wrote:
Hi Ankur,
I assume that your revenueFilter aggregation uses an actual filter and not
a match_all filter? Otherwise you could just remove it.
Are you actually interested in the top hits that match your query? If not,
you could switch to the count search type and move the filter from your
aggregation to the filtered_query, this would be faster.
On Mon, Nov 10, 2014 at 11:53 AM, Ankur Goel <ankr...@gmail.com
<javascript:>> wrote:
On Wednesday, 5 November 2014 19:38:42 UTC+5:30, Adrien Grand wrote:
Can you please show the json of the request that you send to
elasticsearch?
On Wed, Nov 5, 2014 at 10:52 AM, Ankur Goel ankr...@gmail.com wrote:
hi ,
we are trying to run some aggregation over around 5 million documents
with cardinality of the fields of the order of 1000 , the aggregation is a
filter aggregation which wraps underlying term aggregation . Right now
it's taking around 1.2 secs on an average to compute it , the time
increases when no. of documents are increased or I try to do multiple
aggregations. we have aws extra large machines, shards 3 and replication 2
.
1.) can we improve this time (will like it to get it within 1 sec) , I
can see very little if any of field cache being used
2.) how does this scale , it increases with number of documents , how
can I offset that (increasing nodes , replication , sharding ??)
3.) are there any better options (plugins or a different platform for
aggregating data )
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.