I'm looking for a really scalable method to do scoring by tf-idf on a large dataset, so elastic search naturally came to mind. Crucially, I need raw tf-idf, versus the Practical Scoring Function that Lucene uses under the hood. Is there any way to get elastic search to return a raw tf-idf score without the additional fluff? I've tested each of the built-in implementations and none work as well as just tf-idf.
So far I've investigated custom scoring functions, but this doesn't seem to be the right tool for the task.
Also, I'm using a hosted elaticsearch instance, so I don't have access to the internals of the Java code. Only the REST API.
All the examples seemed to be tied to the individual fields and looked like there wasn't any ability to extract the overall document frequency of each term. Would love to find a way to do this though.
See the new advanced scripting docs which describe how to create a script engine, which then has access to lucene internals (you could get raw tf and idf).
Hi Ryan - is there a way to facilitate similar scripting only via REST? I have to use a remote elastic search client so don't have the ability to add a plugin.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.