I'm trying to gauge whether elasticsearch is a good fit for future projects by taking a deep dive into the Painless API. We're an NLP shop with tight deadlines. The prospect of being able to write custom similarity functions without having to create a full-blown Java plugin is prettttty interesting.
So, one of the things I'm looking into is whether it's possible to implement a custom language model-based Similarity. For example: a similarity that defines relevance in terms of the Kulback-Leibler divergence between a document generation and a query generation model. In order to do this though, I need access to the terms that occur in the query, something which the Painless API currently does not provide for in the Similarity context, or in any context, for that matter. In fact, I can't really see how 'Scriped Similarities' (as demonstrated here) allow for much more than writing variations on the TF-IDF weighting scheme.
Unless I'm missing something (please tell me if I am; I would realllly like to be able to use elasticsearch) – would it be possible to open up access to the
query variable in the Similarity context?
Apologies in advance if it's unclear what I'm asking for. What it comes down to, I suppose, is that Scripted Similarities don't really feel like first-class relevance functions at the moment. Their scope seems to be the individual terms in the term vector, which is pretty restrictive.