Hi All,
I am new to ElasticSearch, so any input and suggestion is much appreciated. I have built a random forest classification model (binary class) through scikit learn, and I want to use the probability to predict class 1 to sort the documents (rfc.predict_proba(X_test) from sklearn) in ElasticSearch (version 1.7). The features I use are from both the documents and some user inputs (query). What I am currently doing is:
- Use function score to filter out the documents that meet the query;
- Run a script_score (implemented the random forest classifier model in a Groovy script) as the score to sort the filtered documents from 1.
The problem I am running into is that even when I am only implementing one tree my script already exceeds the java single method limit (64k). The tree has 50 layers and about 15000 nodes. Like I mentioned at the beginning, I am still a newbie. This probably is not the best way to implement things. I am wondering what is a better way, considering the feasibility and performance (speed).
The whole dataset is 25G in size and has 10M documents. After filtering, it's about 30K documents.
Thanks a lot,
Wei