I'm trying to gauge whether elasticsearch is a good fit for future projects by taking a deep dive into the Painless API. We're an NLP shop with tight deadlines. The prospect of being able to write custom similarity functions without having to create a full-blown Java plugin is prettttty interesting.
So, one of the things I'm looking into is whether it's possible to implement a custom language model-based Similarity. For example: a similarity that defines relevance in terms of the Kulback-Leibler divergence between a document generation and a query generation model. In order to do this though, I need access to the terms that occur in the query, something which the Painless API currently does not provide for in the Similarity context, or in any context, for that matter. In fact, I can't really see how 'Scriped Similarities' (as demonstrated here) allow for much more than writing variations on the TF-IDF weighting scheme.
Unless I'm missing something (please tell me if I am; I would realllly like to be able to use elasticsearch) – would it be possible to open up access to the query variable in the Similarity context?
Apologies in advance if it's unclear what I'm asking for. What it comes down to, I suppose, is that Scripted Similarities don't really feel like first-class relevance functions at the moment. Their scope seems to be the individual terms in the term vector, which is pretty restrictive.
The goal of scripted similarities is indeed to allow variations of tf-idf. From the original PR:
The goal of this similarity is to help users who would like to keep the
functionality of the tf-idf similarity that we want to remove, or to allow
for specific use-cases (disabling idf, disabling tf, disabling length norm,
etc.) to not have to build a custom plugin and familiarize with the low-level
Lucene API.
Unfortunately you would need to write an elasticsearch plugin to achieve your goal. Writing custom scoring like this is a very advanced feature. Long ago (1.x/2.x days) there was limited access to term/position data within scoring scripts, but the implementation had to make a number of assumptions about access patterns and was very inefficient, thus not being practical in reality for advanced scoring cases.
Building a custom Similarity is definitely doable, though, and there are users here that have done it and can help as questions arise.
Alright, that makes sense, I guess: keep the implementation 'simple' so as to guarantee performance (in terms of efficiency).
If I may, I would like to make the following (no doubt, highly impractical, possibly impracticable) suggestion: introduce an experimental variant, e.g., ScriptedSimilarityDangerWillRobinson, where people like me can prototype more easily using only Painless, in full acceptance of course of the associated efficiency penalty. Then, if the prototype can be validated with respect to accuracy, one can move on committing resources to to building an actual custom Similarity.
I'm sure that's a lot to ask though. =p
I'll take a closer look at the custom Similarity plugins I've been bookmarking. *dusts off Java book*
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.