Heavy computational load in scripted metric aggregation

Hi all,
I have following problem. I am indexing documents in Elasticsearch where each document is represented by a list of items (in my case integers), the list length can vary between documents. For example, doc1: [1,2,5], doc2: [2,3,5,6], etc. The respective field is called items . Then I execute a scripted metric aggregation as follows:

`metrics = es.search(index=index, body= 
    {"query" : {
        "match_all" : {}
        },
            "aggs": {
                "frequent_pairs": {
                    "scripted_metric": {
                        "init_script" : "state.transactions = [];",
                        "map_script" : "state.transactions.add([params['_source']['items']);", 
                        "combine_script" : naive_script, 
                        "reduce_script" : reduce_cnt
            }
        }
    }

})`

The mapping simply distributes the different item lists to different shards. Then I need to do some heavy computation in each shard, namely to compute all subsets of size 3 in each document and count their number in the shard in a hash map. (I know, ElasticSearch is not meant to be used for such kind of computation but I need it for an application.)

But this works only when either the document lists are short or the total number of documents is small. What happens is that many of the shards fail to report results. I already increased the heap space to 8GB, assigned more shards, and increased the value of timeout:

es = Elasticsearch([{'host':'localhost','port':9200, 'timeout': 1000000, "max_retries":10, "retry_on_timeout":True}])

This improved the results to some degree but still many shards don't return any results and fine-tuning the parameters doesn't seem to help. Is there any solution or ES shouldn't be used for such expensive computations? I am using ES 7.5.2 and I will be happy to provide more details by request. Any help will be appreciated!

`naive_script = """
HashMap itemsets = new HashMap(); 
    int nr_trans = 0;
    for (t in state.transactions){ 
        nr_trans += 1;
        int l = t.length;
        for (int i=0; i<l; i++){
            for (int j = i+1; j<l; j++){
                for (int k = j+1; k < l; k++){
                    String s = Integer.toString(t[i]) + "," + Integer.toString(t[j]) + "," + Integer.toString(t[k]) + ",";
                    int val = 0;
                    if (itemsets.containsKey(s)){
                        val = itemsets.get(s);
                    }
                    itemsets.put(s, val + 1);
                }
            }
        }
    }
    def res = [];
    res.add(itemsets);
    res.add(nr_trans);
    return res;//itemsets;
"""

`

The problem is that the painless scripting language allows only up to 1,000,000 statements to be executed in a loop. A solution that allows the user to change the setting at a cluster level has been discussed but apparently it is not yet released: https://github.com/elastic/elasticsearch/issues/28946 Anyone knows something more about it or some way around?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.