Implementing random forest in ElasticSearch

paulshadepanda · April 8, 2016, 8:08pm

Hi All,

I am new to ElasticSearch, so any input and suggestion is much appreciated. I have built a random forest classification model (binary class) through scikit learn, and I want to use the probability to predict class 1 to sort the documents (rfc.predict_proba(X_test) from sklearn) in ElasticSearch (version 1.7). The features I use are from both the documents and some user inputs (query). What I am currently doing is:

Use function score to filter out the documents that meet the query;
Run a script_score (implemented the random forest classifier model in a Groovy script) as the score to sort the filtered documents from 1.
The problem I am running into is that even when I am only implementing one tree my script already exceeds the java single method limit (64k). The tree has 50 layers and about 15000 nodes. Like I mentioned at the beginning, I am still a newbie. This probably is not the best way to implement things. I am wondering what is a better way, considering the feasibility and performance (speed).

The whole dataset is 25G in size and has 10M documents. After filtering, it's about 30K documents.

Thanks a lot,
Wei

jprante · April 8, 2016, 10:13pm

Forget about script_score, this is not for classifiers. You have to write a plugin for Elasticsearch that can execute random forest classifier.