Heavy computational load in scripted metric aggregation

Konstantin_Kutzkov · January 23, 2020, 4:52pm

Hi all,
I have following problem. I am indexing documents in Elasticsearch where each document is represented by a list of items (in my case integers), the list length can vary between documents. For example, doc1: [1,2,5], doc2: [2,3,5,6], etc. The respective field is called items . Then I execute a scripted metric aggregation as follows:

`metrics = es.search(index=index, body= 
    {"query" : {
        "match_all" : {}
        },
            "aggs": {
                "frequent_pairs": {
                    "scripted_metric": {
                        "init_script" : "state.transactions = [];",
                        "map_script" : "state.transactions.add([params['_source']['items']);", 
                        "combine_script" : naive_script, 
                        "reduce_script" : reduce_cnt
            }
        }
    }

})`

The mapping simply distributes the different item lists to different shards. Then I need to do some heavy computation in each shard, namely to compute all subsets of size 3 in each document and count their number in the shard in a hash map. (I know, ElasticSearch is not meant to be used for such kind of computation but I need it for an application.)

But this works only when either the document lists are short or the total number of documents is small. What happens is that many of the shards fail to report results. I already increased the heap space to 8GB, assigned more shards, and increased the value of timeout:

es = Elasticsearch([{'host':'localhost','port':9200, 'timeout': 1000000, "max_retries":10, "retry_on_timeout":True}])

This improved the results to some degree but still many shards don't return any results and fine-tuning the parameters doesn't seem to help. Is there any solution or ES shouldn't be used for such expensive computations? I am using ES 7.5.2 and I will be happy to provide more details by request. Any help will be appreciated!

`naive_script = """
HashMap itemsets = new HashMap(); 
    int nr_trans = 0;
    for (t in state.transactions){ 
        nr_trans += 1;
        int l = t.length;
        for (int i=0; i<l; i++){
            for (int j = i+1; j<l; j++){
                for (int k = j+1; k < l; k++){
                    String s = Integer.toString(t[i]) + "," + Integer.toString(t[j]) + "," + Integer.toString(t[k]) + ",";
                    int val = 0;
                    if (itemsets.containsKey(s)){
                        val = itemsets.get(s);
                    }
                    itemsets.put(s, val + 1);
                }
            }
        }
    }
    def res = [];
    res.add(itemsets);
    res.add(nr_trans);
    return res;//itemsets;
"""

`

Konstantin_Kutzkov · January 24, 2020, 9:30am

The problem is that the painless scripting language allows only up to 1,000,000 statements to be executed in a loop. A solution that allows the user to change the setting at a cluster level has been discussed but apparently it is not yet released: https://github.com/elastic/elasticsearch/issues/28946 Anyone knows something more about it or some way around?

system · February 21, 2020, 9:30am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch scripted metric aggregation Elasticsearch	1	651	February 7, 2019
Elasticsearch aggregation on million or more data Elasticsearch	6	3397	February 7, 2022
Scripted_metric in Kibana visualization Kibana	6	1235	March 27, 2017
Script Aggregations gather phase on client node Elasticsearch	1	572	July 5, 2017
Doc values are unmodifiable while using scripted metric Elasticsearch	1	565	March 17, 2019

Heavy computational load in scripted metric aggregation

Related topics