Hey,
This thread has expanded quite a bit beyond what was originally asked. I will simply explain the thought process that we go through in ES itself. For us, the decision is quite simple to be honest, our goal is to focus less about being able to plug custom (Java) implementations for specific features, but instead enable similar capabilities to all users through other means (i.e. custom logic). A good example is custom_score query, sure, one can plug in a custom Lucene Query implementation, and implement any custom scoring needed, but we prefer the custom_score route, where we actually empower and enable all users to take advantage of it.
Regarding rescore, its a new feature. The first thing we need is to start to flush out all the additional requirements out of it, and find a way to enable all users (btw, the query rescorer covers quite a wide range of those), and have those provided as built in options. Because the feature is so new, I don't see value in trying to work hard in making its implementation pluggable (internal APIs need to be flushed out, …) , much prefer to work harder in enabling different usage patterns that can be used by all users.
Regarding generic work on documents across all matches of a query, those typically fall under the facets case, but it really depends on the use case. I do see a place where people will just want to write complete custom logic for both the scatter part and the reduce part, we need to enable that. Obviously, the nature of the custom logic differs, but if its aggregations, facet is where it fits.
Last, we do allow for custom implementations in many places, typically driven in where we feel comfortable at enabling it (a combination of the level of confidence we have with the internal APIs, not the external ones). For example, we allow to plug custom Lucene constructs relatively easily.
On Apr 14, 2013, at 12:50 PM, George Stathis gstathis@gmail.com wrote:
I was chatting with Simeon about this offline but I might as well add my comment here. I think the idea about idempotence is a good one. Unless there is a way to pass custom data around shards, that's pretty much what needs to happen at first. I found that out the hard way trying to work on SORL-2072 a while back and being stopped in my tracks by the networking layer. The interfaces just didn't support passing around new fields and custom data. It would be pretty much the same case here. Unless TopDoc and SearchDoc are wrapped, there is not way to get more custom data passed around the wire. The other comment that I made offline to Simeon is that to do what he describes (have access to the entire result set) the pluggable layer IMO probably needs to be in the org.elasticsearch.common.lucene package in the form of custom collectors.
On Saturday, April 13, 2013 1:02:21 PM UTC-4, Simeon Simeonov wrote:
A friend from the ES community pointed out that it's not clear whether what I write about must happen across shards or not.
Sure, a cross-shard solution that can also plug into the aggregation node should be the long-term objective but there is a lot of value in making the current per-shard rescore step pluggable without modifying the aggregation. That's true for two reasons:
-
Many problems may only require custom processing at the shard level
-
Even problems that require custom processing at both the shard and aggregation level would benefit from the processing distribution and data locality of sharding.
The only problems that will not benefit are the ones that must be solved at the aggregation level. This is the minority of problems.
The analogy here is map/reduce processing. The reduce operation should be idempotent. If a hook at the aggregation node is not available, the final reduce step can be performed on the client--of course, net of needing to provision the right data to the client via custom fields or whatever. If a hook is available, it can be performed on ES.
The benefits begin to be unlocked with a custom step on the shards, though. The current abstract API that works on TopDocs is a fine start.
On Saturday, April 13, 2013 12:02:18 AM UTC-4, Simeon Simeonov wrote:
This is a really cool idea. I can see so many uses for this--some in development and some in production--including TDD/debugging/reporting/analytics, before we even get to manipulating the returned results.
The pattern here is no different--in the abstract sense--than that of a stored procedure. Before we say that search engines and databases are different, let's focus on the fact that they both provide high-performance, runtime customizable data services. The same patterns of data generation & use repeat themselves regardless of the specifics of the system. It doesn't matter if it is storage (block or file), databases (SQL or NoSql), integration (messaging or Web services), CMS or search. I've seen this across half a dozen servers my companies have built over the years that have been used by hundreds of thousands of developers in thousands of companies.
The root cause is that there are some operations that should be performed close to where the data is at and that need to look at an entire result set as opposed to one result at a time. If these operation cannot be done close to the data (on the ES cluster, in each shard, etc.), then all the data needs to be shipped out on the wire to the client, which can be very expensive. That's the reason behind stored procedures, on-storage computing, the scriptability of NoSql stores such as Redis, MongoDB & CouchDB and even the custom queries and calculated fields in ES. Only the most specialized key-value stores, e.g., Cassandra & HBase, don't offer this.
One of the very attractive things about ES is its scripting extensibility. After a quick look at the docs and the code, I've found it strange that there is no extensibility point that allows third party code to operate on the entire query result set. Perhaps a more flexible rescoring model can help with that? Unfortunately, right now rescoring seems to be hard-coded. It's not like what the docs seem to imply: that the architecture allows it and other rescoring models aren't written yet. That type of hard-coded dependency feels a bit un-ES like...
To me, the question is not what other types of rescoring should be implemented, which would be like asking what other types of queries should be implemented in ES. How do we answer this question given that ES is used by so many different people in so many different ways? A better question to ask might be how to make ES follow the patterns of successful, high-performance servers and allow for an extension point that operates on the entire result set. It is called rescore now but I see it as a more general transformation step, of which rescoring is a common use case and, of which the current rescoring implementation is the one that made the best sense to build first.
If that were available, the ES community would have a way to develop and share rescoring/transformation modules in an easy way. That would benefit everyone and would help ES grow faster. Without this capability, one of two things will happen. Either these data-demanding operations will be performed on the client or developers will be forced to fork the ES codebase to fix the currently hard-coded approach. In the former case, nothing usable could be shared with the community. In the latter case, as with the current hard-coded implementation, nobody will have the incentives to do it well and so there will be no useful pull request contributions. So, the ultimate issue here is as much about technology as it is about open-source community management.
Simon, assuming one wanted to make rescoring scriptable, how should one approach adding this to ES?
On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:
On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:
Hi,
How does one plug in a custom Rescorer into Elasticsearch?
This is from Simon's writeup on query rescorer:
"
Currently the rescore API has only one implementation (the query
rescorer) which modifies the result set in-place. Future developments could include dedicated rescore results if needed by the implemenation ie. a pair-wise reranker.
"
Sounds like alternative implementations should be pluggable, and it does look like there are a number of abstract classes and interfaces to allow alternative implementations. I am just not sure if there is a standard way to tell ES about my alternative rescorer... is there?
not yet, do you have any alternative in mind? can you share your thoughts on this?
simon
Thanks,
Otis
ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.