How to plug in alternative Rescorer?

Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the query
rescorer) which modifies the result set in-place. Future developments could
include dedicated rescore results if needed by the implemenation ie. a
pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it does
look like there are a number of abstract classes and interfaces to allow
alternative implementations. I am just not sure if there is a standard way
to tell ES about my alternative rescorer... is there?

Thanks,
Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:

Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the query
rescorer) which modifies the result set in-place. Future developments could
include dedicated rescore results if needed by the implemenation ie. a
pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it does
look like there are a number of abstract classes and interfaces to allow
alternative implementations. I am just not sure if there is a standard way
to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your thoughts
on this?

simon

Thanks,
Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Simon,

The idea is to use the Rescorer step as a way to modify matching documents,
by taking advantage of the TopDocs collection in the rescore method.

We would like to answer two types of queries:
eg 1: there is a field called "value", and we want to return only the max
"valued" documents per document "group", group being another field.
eg 2: create a field synthetically, like ScriptField does, with a value
calculated by looking at all the documents in TopDocs.

We thought it made sense to do this as a Rescorer extension, because it
provides a way to summarize and aggregate information in each shard, so
less overhead over the wire, but if you can think of a cleaner way to hook
this ideas please let me know, even it doesn't involve Rescorer.

Thanks,
Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html

On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:

On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:

Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the query
rescorer) which modifies the result set in-place. Future developments could
include dedicated rescore results if needed by the implemenation ie. a
pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it does
look like there are a number of abstract classes and interfaces to allow
alternative implementations. I am just not sure if there is a standard way
to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your thoughts
on this?

simon

Thanks,
Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

This is a really cool idea. I can see so many uses for this--some in
development and some in
production--including TDD/debugging/reporting/analytics, before we even get
to manipulating the returned results.

The pattern here is no different--in the abstract sense--than that of a
stored procedure. Before we say that search engines and databases are
different, let's focus on the fact that they both provide high-performance,
runtime customizable data services. The same patterns of data generation &
use repeat themselves regardless of the specifics of the system. It doesn't
matter if it is storage (block or file), databases (SQL or NoSql),
integration (messaging or Web services), CMS or search. I've seen this
across half a dozen servers my companies have built over the years that
have been used by hundreds of thousands of developers in thousands of
companies.

The root cause is that there are some operations that should be performed
close to where the data is at and that need to look at an entire result set
as opposed to one result at a time. If these operation cannot be done close
to the data (on the ES cluster, in each shard, etc.), then all the data
needs to be shipped out on the wire to the client, which can be very
expensive. That's the reason behind stored procedures, on-storage
computing, the scriptability of NoSql stores such as Redis, MongoDB &
CouchDB and even the custom queries and calculated fields in ES. Only the
most specialized key-value stores, e.g., Cassandra & HBase, don't offer
this.

One of the very attractive things about ES is its scripting extensibility.
After a quick look at the docs and the code, I've found it strange that
there is no extensibility point that allows third party code to operate on
the entire query result set. Perhaps a more flexible rescoring model can
help with that? Unfortunately, right now rescoring seems to be hard-coded.
It's not like what the docs seem to imply: that the architecture allows it
and other rescoring models aren't written yet. That type of hard-coded
dependency feels a bit un-ES like...

To me, the question is not what other types of rescoring should be
implemented, which would be like asking what other types of queries should
be implemented in ES. How do we answer this question given that ES is used
by so many different people in so many different ways? A better question to
ask might be how to make ES follow the patterns of successful,
high-performance servers and allow for an extension point that operates on
the entire result set. It is called rescore now but I see it as a more
general transformation step, of which rescoring is a common use case and,
of which the current rescoring implementation is the one that made the best
sense to build first.

If that were available, the ES community would have a way to develop and
share rescoring/transformation modules in an easy way. That would benefit
everyone and would help ES grow faster. Without this capability, one of two
things will happen. Either these data-demanding operations will be
performed on the client or developers will be forced to fork the ES
codebase to fix the currently hard-coded approach. In the former case,
nothing usable could be shared with the community. In the latter case, as
with the current hard-coded implementation, nobody will have the incentives
to do it well and so there will be no useful pull request contributions.
So, the ultimate issue here is as much about technology as it is about
open-source community management.

Simon, assuming one wanted to make rescoring scriptable, how should one
approach adding this to ES?

On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:

On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:

Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the query
rescorer) which modifies the result set in-place. Future developments could
include dedicated rescore results if needed by the implemenation ie. a
pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it does
look like there are a number of abstract classes and interfaces to allow
alternative implementations. I am just not sure if there is a standard way
to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your thoughts
on this?

simon

Thanks,
Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

A friend from the ES community pointed out that it's not clear whether what
I write about must happen across shards or not.

Sure, a cross-shard solution that can also plug into the aggregation node
should be the long-term objective but there is a lot of value in making the
current per-shard rescore step pluggable without modifying the
aggregation. That's true for two reasons:

  1. Many problems may only require custom processing at the shard level

  2. Even problems that require custom processing at both the shard and
    aggregation level would benefit from the processing distribution and data
    locality of sharding.

The only problems that will not benefit are the ones that must be solved at
the aggregation level. This is the minority of problems.

The analogy here is map/reduce processing. The reduce operation should be
idempotent http://en.wikipedia.org/wiki/Idempotence. If a hook at the
aggregation node is not available, the final reduce step can be performed
on the client--of course, net of needing to provision the right data to the
client via custom fields or whatever. If a hook is available, it can be
performed on ES.

The benefits begin to be unlocked with a custom step on the shards, though.
The current abstract API that works on TopDocs is a fine start.

On Saturday, April 13, 2013 12:02:18 AM UTC-4, Simeon Simeonov wrote:

This is a really cool idea. I can see so many uses for this--some in
development and some in
production--including TDD/debugging/reporting/analytics, before we even get
to manipulating the returned results.

The pattern here is no different--in the abstract sense--than that of a
stored procedure. Before we say that search engines and databases are
different, let's focus on the fact that they both provide high-performance,
runtime customizable data services. The same patterns of data generation &
use repeat themselves regardless of the specifics of the system. It doesn't
matter if it is storage (block or file), databases (SQL or NoSql),
integration (messaging or Web services), CMS or search. I've seen this
across half a dozen servers my companies have built over the years that
have been used by hundreds of thousands of developers in thousands of
companies.

The root cause is that there are some operations that should be performed
close to where the data is at and that need to look at an entire result set
as opposed to one result at a time. If these operation cannot be done
close to the data (on the ES cluster, in each shard, etc.), then all the
data needs to be shipped out on the wire to the client, which can be very
expensive. That's the reason behind stored procedures, on-storage
computing, the scriptability of NoSql stores such as Redis, MongoDB &
CouchDB and even the custom queries and calculated fields in ES. Only the
most specialized key-value stores, e.g., Cassandra & HBase, don't offer
this.

One of the very attractive things about ES is its scripting extensibility.
After a quick look at the docs and the code, I've found it strange that
there is no extensibility point that allows third party code to operate on
the entire query result set. Perhaps a more flexible rescoring model can
help with that? Unfortunately, right now rescoring seems to be
hard-coded. It's not like what the docs seem to imply: that the
architecture allows it and other rescoring models aren't written yet. That
type of hard-coded dependency feels a bit un-ES like...

To me, the question is not what other types of rescoring should be
implemented, which would be like asking what other types of queries should
be implemented in ES. How do we answer this question given that ES is used
by so many different people in so many different ways? A better question to
ask might be how to make ES follow the patterns of successful,
high-performance servers and allow for an extension point that operates on
the entire result set. It is called rescore now but I see it as a more
general transformation step, of which rescoring is a common use case and,
of which the current rescoring implementation is the one that made the best
sense to build first.

If that were available, the ES community would have a way to develop and
share rescoring/transformation modules in an easy way. That would benefit
everyone and would help ES grow faster. Without this capability, one of two
things will happen. Either these data-demanding operations will be
performed on the client or developers will be forced to fork the ES
codebase to fix the currently hard-coded approach. In the former case,
nothing usable could be shared with the community. In the latter case, as
with the current hard-coded implementation, nobody will have the incentives
to do it well and so there will be no useful pull request contributions.
So, the ultimate issue here is as much about technology as it is about
open-source community management.

Simon, assuming one wanted to make rescoring scriptable, how should one
approach adding this to ES?

On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:

On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:

Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the query
rescorer) which modifies the result set in-place. Future developments could
include dedicated rescore results if needed by the implemenation ie. a
pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it does
look like there are a number of abstract classes and interfaces to allow
alternative implementations. I am just not sure if there is a standard way
to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your thoughts
on this?

simon

Thanks,
Otis

ELASTICSEARCH Performance Monitoring -
http://sematext.com/spm/index.html

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I was chatting with Simeon about this offline but I might as well add my
comment here. I think the idea about idempotence is a good one. Unless
there is a way to pass custom data around shards, that's pretty much what
needs to happen at first. I found that out the hard way trying to work on
SORL-2072 a while back and being stopped in my tracks by the networking
layer. The interfaces just didn't support passing around new fields and
custom data. It would be pretty much the same case here. Unless TopDoc and
SearchDoc are wrapped, there is not way to get more custom data passed
around the wire. The other comment that I made offline to Simeon is that to
do what he describes (have access to the entire result set) the pluggable
layer IMO probably needs to be in the org.elasticsearch.common.lucene
package in the form of custom collectors.

On Saturday, April 13, 2013 1:02:21 PM UTC-4, Simeon Simeonov wrote:

A friend from the ES community pointed out that it's not clear whether
what I write about must happen across shards or not.

Sure, a cross-shard solution that can also plug into the aggregation node
should be the long-term objective but there is a lot of value in making the
current per-shard rescore step pluggable without modifying the
aggregation. That's true for two reasons:

  1. Many problems may only require custom processing at the shard level

  2. Even problems that require custom processing at both the shard and
    aggregation level would benefit from the processing distribution and data
    locality of sharding.

The only problems that will not benefit are the ones that must be solved
at the aggregation level. This is the minority of problems.

The analogy here is map/reduce processing. The reduce operation should be
idempotent http://en.wikipedia.org/wiki/Idempotence. If a hook at the
aggregation node is not available, the final reduce step can be performed
on the client--of course, net of needing to provision the right data to
the client via custom fields or whatever. If a hook is available, it can
be performed on ES.

The benefits begin to be unlocked with a custom step on the shards,
though. The current abstract API that works on TopDocs is a fine start.

On Saturday, April 13, 2013 12:02:18 AM UTC-4, Simeon Simeonov wrote:

This is a really cool idea. I can see so many uses for this--some in
development and some in
production--including TDD/debugging/reporting/analytics, before we even get
to manipulating the returned results.

The pattern here is no different--in the abstract sense--than that of a
stored procedure. Before we say that search engines and databases are
different, let's focus on the fact that they both provide high-performance,
runtime customizable data services. The same patterns of data generation &
use repeat themselves regardless of the specifics of the system. It doesn't
matter if it is storage (block or file), databases (SQL or NoSql),
integration (messaging or Web services), CMS or search. I've seen this
across half a dozen servers my companies have built over the years that
have been used by hundreds of thousands of developers in thousands of
companies.

The root cause is that there are some operations that should be performed
close to where the data is at and that need to look at an entire result set
as opposed to one result at a time. If these operation cannot be done
close to the data (on the ES cluster, in each shard, etc.), then all the
data needs to be shipped out on the wire to the client, which can be very
expensive. That's the reason behind stored procedures, on-storage
computing, the scriptability of NoSql stores such as Redis, MongoDB &
CouchDB and even the custom queries and calculated fields in ES. Only the
most specialized key-value stores, e.g., Cassandra & HBase, don't offer
this.

One of the very attractive things about ES is its scripting
extensibility. After a quick look at the docs and the code, I've found it
strange that there is no extensibility point that allows third party code
to operate on the entire query result set. Perhaps a more flexible
rescoring model can help with that? Unfortunately, right now rescoring
seems to be hard-coded. It's not like what the docs seem to imply: that the
architecture allows it and other rescoring models aren't written yet. That
type of hard-coded dependency feels a bit un-ES like...

To me, the question is not what other types of rescoring should be
implemented, which would be like asking what other types of queries should
be implemented in ES. How do we answer this question given that ES is used
by so many different people in so many different ways? A better question to
ask might be how to make ES follow the patterns of successful,
high-performance servers and allow for an extension point that operates on
the entire result set. It is called rescore now but I see it as a more
general transformation step, of which rescoring is a common use case and,
of which the current rescoring implementation is the one that made the best
sense to build first.

If that were available, the ES community would have a way to develop and
share rescoring/transformation modules in an easy way. That would benefit
everyone and would help ES grow faster. Without this capability, one of two
things will happen. Either these data-demanding operations will be
performed on the client or developers will be forced to fork the ES
codebase to fix the currently hard-coded approach. In the former case,
nothing usable could be shared with the community. In the latter case, as
with the current hard-coded implementation, nobody will have the incentives
to do it well and so there will be no useful pull request contributions.
So, the ultimate issue here is as much about technology as it is about
open-source community management.

Simon, assuming one wanted to make rescoring scriptable, how should one
approach adding this to ES?

On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:

On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:

Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the query
rescorer) which modifies the result set in-place. Future developments could
include dedicated rescore results if needed by the implemenation ie. a
pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it
does look like there are a number of abstract classes and interfaces to
allow alternative implementations. I am just not sure if there is a
standard way to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your
thoughts on this?

simon

Thanks,
Otis

ELASTICSEARCH Performance Monitoring -
http://sematext.com/spm/index.html

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

This thread has expanded quite a bit beyond what was originally asked. I will simply explain the thought process that we go through in ES itself. For us, the decision is quite simple to be honest, our goal is to focus less about being able to plug custom (Java) implementations for specific features, but instead enable similar capabilities to all users through other means (i.e. custom logic). A good example is custom_score query, sure, one can plug in a custom Lucene Query implementation, and implement any custom scoring needed, but we prefer the custom_score route, where we actually empower and enable all users to take advantage of it.

Regarding rescore, its a new feature. The first thing we need is to start to flush out all the additional requirements out of it, and find a way to enable all users (btw, the query rescorer covers quite a wide range of those), and have those provided as built in options. Because the feature is so new, I don't see value in trying to work hard in making its implementation pluggable (internal APIs need to be flushed out, …) , much prefer to work harder in enabling different usage patterns that can be used by all users.

Regarding generic work on documents across all matches of a query, those typically fall under the facets case, but it really depends on the use case. I do see a place where people will just want to write complete custom logic for both the scatter part and the reduce part, we need to enable that. Obviously, the nature of the custom logic differs, but if its aggregations, facet is where it fits.

Last, we do allow for custom implementations in many places, typically driven in where we feel comfortable at enabling it (a combination of the level of confidence we have with the internal APIs, not the external ones). For example, we allow to plug custom Lucene constructs relatively easily.

On Apr 14, 2013, at 12:50 PM, George Stathis gstathis@gmail.com wrote:

I was chatting with Simeon about this offline but I might as well add my comment here. I think the idea about idempotence is a good one. Unless there is a way to pass custom data around shards, that's pretty much what needs to happen at first. I found that out the hard way trying to work on SORL-2072 a while back and being stopped in my tracks by the networking layer. The interfaces just didn't support passing around new fields and custom data. It would be pretty much the same case here. Unless TopDoc and SearchDoc are wrapped, there is not way to get more custom data passed around the wire. The other comment that I made offline to Simeon is that to do what he describes (have access to the entire result set) the pluggable layer IMO probably needs to be in the org.elasticsearch.common.lucene package in the form of custom collectors.

On Saturday, April 13, 2013 1:02:21 PM UTC-4, Simeon Simeonov wrote:
A friend from the ES community pointed out that it's not clear whether what I write about must happen across shards or not.

Sure, a cross-shard solution that can also plug into the aggregation node should be the long-term objective but there is a lot of value in making the current per-shard rescore step pluggable without modifying the aggregation. That's true for two reasons:

  1. Many problems may only require custom processing at the shard level

  2. Even problems that require custom processing at both the shard and aggregation level would benefit from the processing distribution and data locality of sharding.

The only problems that will not benefit are the ones that must be solved at the aggregation level. This is the minority of problems.

The analogy here is map/reduce processing. The reduce operation should be idempotent. If a hook at the aggregation node is not available, the final reduce step can be performed on the client--of course, net of needing to provision the right data to the client via custom fields or whatever. If a hook is available, it can be performed on ES.

The benefits begin to be unlocked with a custom step on the shards, though. The current abstract API that works on TopDocs is a fine start.

On Saturday, April 13, 2013 12:02:18 AM UTC-4, Simeon Simeonov wrote:
This is a really cool idea. I can see so many uses for this--some in development and some in production--including TDD/debugging/reporting/analytics, before we even get to manipulating the returned results.

The pattern here is no different--in the abstract sense--than that of a stored procedure. Before we say that search engines and databases are different, let's focus on the fact that they both provide high-performance, runtime customizable data services. The same patterns of data generation & use repeat themselves regardless of the specifics of the system. It doesn't matter if it is storage (block or file), databases (SQL or NoSql), integration (messaging or Web services), CMS or search. I've seen this across half a dozen servers my companies have built over the years that have been used by hundreds of thousands of developers in thousands of companies.

The root cause is that there are some operations that should be performed close to where the data is at and that need to look at an entire result set as opposed to one result at a time. If these operation cannot be done close to the data (on the ES cluster, in each shard, etc.), then all the data needs to be shipped out on the wire to the client, which can be very expensive. That's the reason behind stored procedures, on-storage computing, the scriptability of NoSql stores such as Redis, MongoDB & CouchDB and even the custom queries and calculated fields in ES. Only the most specialized key-value stores, e.g., Cassandra & HBase, don't offer this.

One of the very attractive things about ES is its scripting extensibility. After a quick look at the docs and the code, I've found it strange that there is no extensibility point that allows third party code to operate on the entire query result set. Perhaps a more flexible rescoring model can help with that? Unfortunately, right now rescoring seems to be hard-coded. It's not like what the docs seem to imply: that the architecture allows it and other rescoring models aren't written yet. That type of hard-coded dependency feels a bit un-ES like...

To me, the question is not what other types of rescoring should be implemented, which would be like asking what other types of queries should be implemented in ES. How do we answer this question given that ES is used by so many different people in so many different ways? A better question to ask might be how to make ES follow the patterns of successful, high-performance servers and allow for an extension point that operates on the entire result set. It is called rescore now but I see it as a more general transformation step, of which rescoring is a common use case and, of which the current rescoring implementation is the one that made the best sense to build first.

If that were available, the ES community would have a way to develop and share rescoring/transformation modules in an easy way. That would benefit everyone and would help ES grow faster. Without this capability, one of two things will happen. Either these data-demanding operations will be performed on the client or developers will be forced to fork the ES codebase to fix the currently hard-coded approach. In the former case, nothing usable could be shared with the community. In the latter case, as with the current hard-coded implementation, nobody will have the incentives to do it well and so there will be no useful pull request contributions. So, the ultimate issue here is as much about technology as it is about open-source community management.

Simon, assuming one wanted to make rescoring scriptable, how should one approach adding this to ES?

On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:

On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:
Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the query rescorer) which modifies the result set in-place. Future developments could include dedicated rescore results if needed by the implemenation ie. a pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it does look like there are a number of abstract classes and interfaces to allow alternative implementations. I am just not sure if there is a standard way to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your thoughts on this?

simon

Thanks,
Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Shay, thanks for sharing your design objectives. Since I'm new to ES, can
you point me to the areas in the faceting system where you think a general
transformation step could plug in, if/when it makes sense to add one.

Thanks,
Sim

On Monday, April 15, 2013 2:53:12 PM UTC-4, kimchy wrote:

Hey,

This thread has expanded quite a bit beyond what was originally asked.
I will simply explain the thought process that we go through in ES itself.
For us, the decision is quite simple to be honest, our goal is to focus
less about being able to plug custom (Java) implementations for specific
features, but instead enable similar capabilities to all users through
other means (i.e. custom logic). A good example is custom_score query,
sure, one can plug in a custom Lucene Query implementation, and implement
any custom scoring needed, but we prefer the custom_score route, where we
actually empower and enable all users to take advantage of it.

Regarding rescore, its a new feature. The first thing we need is to
start to flush out all the additional requirements out of it, and find a
way to enable all users (btw, the query rescorer covers quite a wide range
of those), and have those provided as built in options. Because the feature
is so new, I don't see value in trying to work hard in making its
implementation pluggable (internal APIs need to be flushed out, …) , much
prefer to work harder in enabling different usage patterns that can be used
by all users.

Regarding generic work on documents across all matches of a query,
those typically fall under the facets case, but it really depends on the
use case. I do see a place where people will just want to write complete
custom logic for both the scatter part and the reduce part, we need to
enable that. Obviously, the nature of the custom logic differs, but if
its aggregations, facet is where it fits.

Last, we do allow for custom implementations in many places, typically
driven in where we feel comfortable at enabling it (a combination of the
level of confidence we have with the internal APIs, not the external
ones). For example, we allow to plug custom Lucene constructs relatively
easily.

On Apr 14, 2013, at 12:50 PM, George Stathis <gsta...@gmail.com<javascript:>>
wrote:

I was chatting with Simeon about this offline but I might as well add my
comment here. I think the idea about idempotence is a good one. Unless
there is a way to pass custom data around shards, that's pretty much what
needs to happen at first. I found that out the hard way trying to work on
SORL-2072 a while back and being stopped in my tracks by the networking
layer. The interfaces just didn't support passing around new fields and
custom data. It would be pretty much the same case here. Unless TopDoc and
SearchDoc are wrapped, there is not way to get more custom data passed
around the wire. The other comment that I made offline to Simeon is that to
do what he describes (have access to the entire result set) the pluggable
layer IMO probably needs to be in the org.elasticsearch.common.lucene
package in the form of custom collectors.

On Saturday, April 13, 2013 1:02:21 PM UTC-4, Simeon Simeonov wrote:

A friend from the ES community pointed out that it's not clear whether
what I write about must happen across shards or not.

Sure, a cross-shard solution that can also plug into the aggregation node
should be the long-term objective but there is a lot of value in making the
current per-shard rescore step pluggable without modifying the
aggregation. That's true for two reasons:

  1. Many problems may only require custom processing at the shard level

  2. Even problems that require custom processing at both the shard and
    aggregation level would benefit from the processing distribution and data
    locality of sharding.

The only problems that will not benefit are the ones that must be solved
at the aggregation level. This is the minority of problems.

The analogy here is map/reduce processing. The reduce operation should be
idempotent http://en.wikipedia.org/wiki/Idempotence. If a hook at the
aggregation node is not available, the final reduce step can be performed
on the client--of course, net of needing to provision the right data to
the client via custom fields or whatever. If a hook is available, it can
be performed on ES.

The benefits begin to be unlocked with a custom step on the shards,
though. The current abstract API that works on TopDocs is a fine start.

On Saturday, April 13, 2013 12:02:18 AM UTC-4, Simeon Simeonov wrote:

This is a really cool idea. I can see so many uses for this--some in
development and some in
production--including TDD/debugging/reporting/analytics, before we even get
to manipulating the returned results.

The pattern here is no different--in the abstract sense--than that of a
stored procedure. Before we say that search engines and databases are
different, let's focus on the fact that they both provide high-performance,
runtime customizable data services. The same patterns of data generation &
use repeat themselves regardless of the specifics of the system. It doesn't
matter if it is storage (block or file), databases (SQL or NoSql),
integration (messaging or Web services), CMS or search. I've seen this
across half a dozen servers my companies have built over the years that
have been used by hundreds of thousands of developers in thousands of
companies.

The root cause is that there are some operations that should be
performed close to where the data is at and that need to look at an entire
result set as opposed to one result at a time. If these operation
cannot be done close to the data (on the ES cluster, in each shard, etc.),
then all the data needs to be shipped out on the wire to the client, which
can be very expensive. That's the reason behind stored procedures,
on-storage computing, the scriptability of NoSql stores such as Redis,
MongoDB & CouchDB and even the custom queries and calculated fields in ES.
Only the most specialized key-value stores, e.g., Cassandra & HBase, don't
offer this.

One of the very attractive things about ES is its scripting
extensibility. After a quick look at the docs and the code, I've found it
strange that there is no extensibility point that allows third party code
to operate on the entire query result set. Perhaps a more flexible
rescoring model can help with that? Unfortunately, right now rescoring
seems to be hard-coded. It's not like what the docs seem to imply: that the
architecture allows it and other rescoring models aren't written yet. That
type of hard-coded dependency feels a bit un-ES like...

To me, the question is not what other types of rescoring should be
implemented, which would be like asking what other types of queries should
be implemented in ES. How do we answer this question given that ES is used
by so many different people in so many different ways? A better question to
ask might be how to make ES follow the patterns of successful,
high-performance servers and allow for an extension point that operates on
the entire result set. It is called rescore now but I see it as a more
general transformation step, of which rescoring is a common use case and,
of which the current rescoring implementation is the one that made the best
sense to build first.

If that were available, the ES community would have a way to develop and
share rescoring/transformation modules in an easy way. That would benefit
everyone and would help ES grow faster. Without this capability, one of two
things will happen. Either these data-demanding operations will be
performed on the client or developers will be forced to fork the ES
codebase to fix the currently hard-coded approach. In the former case,
nothing usable could be shared with the community. In the latter case, as
with the current hard-coded implementation, nobody will have the incentives
to do it well and so there will be no useful pull request contributions.
So, the ultimate issue here is as much about technology as it is about
open-source community management.

Simon, assuming one wanted to make rescoring scriptable, how should one
approach adding this to ES?

On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:

On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:

Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the query
rescorer) which modifies the result set in-place. Future developments could
include dedicated rescore results if needed by the implemenation ie. a
pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it
does look like there are a number of abstract classes and interfaces to
allow alternative implementations. I am just not sure if there is a
standard way to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your
thoughts on this?

simon

Thanks,
Otis

ELASTICSEARCH Performance Monitoring -
http://sematext.com/spm/index.html

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

If I undertand this correctly, the hope is that higher-level functionality
can be exposed in simpler, less "intrusive" way and be used to satisfy not
a narrower, more specific use case, but be the enabler for N different
user-level features? That seems fine by me, as long as this higher-level
functionality can indeed satisfy things like what Simeon brought up or what
I initially asked about.

Regarding generic work on documents across all matches of a query, those
typically fall under the facets case, but it really depends on the use
case. I do see a place where people will just want to write complete custom
logic for both the scatter part and the reduce part, we need to enable
that. Obviously, the nature of the custom logic differs, but if its
aggregations, facet is where it fits.

Shay & Co, are you referring to something like
https://github.com/imotov/elasticsearch-facet-script ?
If not, is there something else in ES itself or elsewhere that we could
model things after that you could point us to?

Thanks,
Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html

On Monday, April 15, 2013 2:53:12 PM UTC-4, kimchy wrote:

Hey,

This thread has expanded quite a bit beyond what was originally asked.
I will simply explain the thought process that we go through in ES itself.
For us, the decision is quite simple to be honest, our goal is to focus
less about being able to plug custom (Java) implementations for specific
features, but instead enable similar capabilities to all users through
other means (i.e. custom logic). A good example is custom_score query,
sure, one can plug in a custom Lucene Query implementation, and implement
any custom scoring needed, but we prefer the custom_score route, where we
actually empower and enable all users to take advantage of it.

Regarding rescore, its a new feature. The first thing we need is to
start to flush out all the additional requirements out of it, and find a
way to enable all users (btw, the query rescorer covers quite a wide range
of those), and have those provided as built in options. Because the feature
is so new, I don't see value in trying to work hard in making its
implementation pluggable (internal APIs need to be flushed out, …) , much
prefer to work harder in enabling different usage patterns that can be used
by all users.

Regarding generic work on documents across all matches of a query,
those typically fall under the facets case, but it really depends on the
use case. I do see a place where people will just want to write complete
custom logic for both the scatter part and the reduce part, we need to
enable that. Obviously, the nature of the custom logic differs, but if
its aggregations, facet is where it fits.

Last, we do allow for custom implementations in many places, typically
driven in where we feel comfortable at enabling it (a combination of the
level of confidence we have with the internal APIs, not the external
ones). For example, we allow to plug custom Lucene constructs relatively
easily.

On Apr 14, 2013, at 12:50 PM, George Stathis <gsta...@gmail.com<javascript:>>
wrote:

I was chatting with Simeon about this offline but I might as well add my
comment here. I think the idea about idempotence is a good one. Unless
there is a way to pass custom data around shards, that's pretty much what
needs to happen at first. I found that out the hard way trying to work on
SORL-2072 a while back and being stopped in my tracks by the networking
layer. The interfaces just didn't support passing around new fields and
custom data. It would be pretty much the same case here. Unless TopDoc and
SearchDoc are wrapped, there is not way to get more custom data passed
around the wire. The other comment that I made offline to Simeon is that to
do what he describes (have access to the entire result set) the pluggable
layer IMO probably needs to be in the org.elasticsearch.common.lucene
package in the form of custom collectors.

On Saturday, April 13, 2013 1:02:21 PM UTC-4, Simeon Simeonov wrote:

A friend from the ES community pointed out that it's not clear whether
what I write about must happen across shards or not.

Sure, a cross-shard solution that can also plug into the aggregation node
should be the long-term objective but there is a lot of value in making the
current per-shard rescore step pluggable without modifying the
aggregation. That's true for two reasons:

  1. Many problems may only require custom processing at the shard level

  2. Even problems that require custom processing at both the shard and
    aggregation level would benefit from the processing distribution and data
    locality of sharding.

The only problems that will not benefit are the ones that must be solved
at the aggregation level. This is the minority of problems.

The analogy here is map/reduce processing. The reduce operation should be
idempotent http://en.wikipedia.org/wiki/Idempotence. If a hook at the
aggregation node is not available, the final reduce step can be performed
on the client--of course, net of needing to provision the right data to
the client via custom fields or whatever. If a hook is available, it can
be performed on ES.

The benefits begin to be unlocked with a custom step on the shards,
though. The current abstract API that works on TopDocs is a fine start.

On Saturday, April 13, 2013 12:02:18 AM UTC-4, Simeon Simeonov wrote:

This is a really cool idea. I can see so many uses for this--some in
development and some in
production--including TDD/debugging/reporting/analytics, before we even get
to manipulating the returned results.

The pattern here is no different--in the abstract sense--than that of a
stored procedure. Before we say that search engines and databases are
different, let's focus on the fact that they both provide high-performance,
runtime customizable data services. The same patterns of data generation &
use repeat themselves regardless of the specifics of the system. It doesn't
matter if it is storage (block or file), databases (SQL or NoSql),
integration (messaging or Web services), CMS or search. I've seen this
across half a dozen servers my companies have built over the years that
have been used by hundreds of thousands of developers in thousands of
companies.

The root cause is that there are some operations that should be
performed close to where the data is at and that need to look at an entire
result set as opposed to one result at a time. If these operation
cannot be done close to the data (on the ES cluster, in each shard, etc.),
then all the data needs to be shipped out on the wire to the client, which
can be very expensive. That's the reason behind stored procedures,
on-storage computing, the scriptability of NoSql stores such as Redis,
MongoDB & CouchDB and even the custom queries and calculated fields in ES.
Only the most specialized key-value stores, e.g., Cassandra & HBase, don't
offer this.

One of the very attractive things about ES is its scripting
extensibility. After a quick look at the docs and the code, I've found it
strange that there is no extensibility point that allows third party code
to operate on the entire query result set. Perhaps a more flexible
rescoring model can help with that? Unfortunately, right now rescoring
seems to be hard-coded. It's not like what the docs seem to imply: that the
architecture allows it and other rescoring models aren't written yet. That
type of hard-coded dependency feels a bit un-ES like...

To me, the question is not what other types of rescoring should be
implemented, which would be like asking what other types of queries should
be implemented in ES. How do we answer this question given that ES is used
by so many different people in so many different ways? A better question to
ask might be how to make ES follow the patterns of successful,
high-performance servers and allow for an extension point that operates on
the entire result set. It is called rescore now but I see it as a more
general transformation step, of which rescoring is a common use case and,
of which the current rescoring implementation is the one that made the best
sense to build first.

If that were available, the ES community would have a way to develop and
share rescoring/transformation modules in an easy way. That would benefit
everyone and would help ES grow faster. Without this capability, one of two
things will happen. Either these data-demanding operations will be
performed on the client or developers will be forced to fork the ES
codebase to fix the currently hard-coded approach. In the former case,
nothing usable could be shared with the community. In the latter case, as
with the current hard-coded implementation, nobody will have the incentives
to do it well and so there will be no useful pull request contributions.
So, the ultimate issue here is as much about technology as it is about
open-source community management.

Simon, assuming one wanted to make rescoring scriptable, how should one
approach adding this to ES?

On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:

On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:

Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the query
rescorer) which modifies the result set in-place. Future developments could
include dedicated rescore results if needed by the implemenation ie. a
pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it
does look like there are a number of abstract classes and interfaces to
allow alternative implementations. I am just not sure if there is a
standard way to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your
thoughts on this?

simon

Thanks,
Otis

ELASTICSEARCH Performance Monitoring -
http://sematext.com/spm/index.html

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

Quoting Shay: "Obviously, the nature of the custom logic differs, but if
its aggregations, facet is where it fits.".
I think in this case that is not a good fit for facets, as what's needed is
not only aggregating, ie. "get the max value of a certain field with a
group by restriction", but also manipulating the TopDocs, ie. "removing the
not-max-valued documents", and some information must be added to the
winning documents too (like ScriptField does).

This specific use case must do something like:

  1. run a ES query
  2. in each shard get the TopDocs and filter some documents according to a
    "group by" function and a "unique" restriction. Also add to the winning
    documents some extra fields like ScriptField does.
  3. in the calling node also execute the filtering step 2)

The general use case would be a map/reduce hook that could manipulate the
data in the shards locally, and later at reducing. Also there are some
calculations done from that reducing step that would need to be added to
the returned documents, like ScriptField does.

Do you think it might be a good addition to the current code base?
Would you be interested in coding it or either accepting a pull request for
this? For pull request, could you provide a bit of guidance to implement
this in a clean way?

Thanks,
Sebastian.

On Mon, Apr 15, 2013 at 4:58 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

Hi,

If I undertand this correctly, the hope is that higher-level functionality
can be exposed in simpler, less "intrusive" way and be used to satisfy not
a narrower, more specific use case, but be the enabler for N different
user-level features? That seems fine by me, as long as this higher-level
functionality can indeed satisfy things like what Simeon brought up or what
I initially asked about.

Regarding generic work on documents across all matches of a query, those
typically fall under the facets case, but it really depends on the use
case. I do see a place where people will just want to write complete custom
logic for both the scatter part and the reduce part, we need to enable
that. Obviously, the nature of the custom logic differs, but if its
aggregations, facet is where it fits.

Shay & Co, are you referring to something like
https://github.com/imotov/elasticsearch-facet-script ?
If not, is there something else in ES itself or elsewhere that we could
model things after that you could point us to?

Thanks,
Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html

On Monday, April 15, 2013 2:53:12 PM UTC-4, kimchy wrote:

Hey,

This thread has expanded quite a bit beyond what was originally asked.
I will simply explain the thought process that we go through in ES itself.
For us, the decision is quite simple to be honest, our goal is to focus
less about being able to plug custom (Java) implementations for specific
features, but instead enable similar capabilities to all users through
other means (i.e. custom logic). A good example is custom_score query,
sure, one can plug in a custom Lucene Query implementation, and implement
any custom scoring needed, but we prefer the custom_score route, where we
actually empower and enable all users to take advantage of it.

Regarding rescore, its a new feature. The first thing we need is to
start to flush out all the additional requirements out of it, and find a
way to enable all users (btw, the query rescorer covers quite a wide range
of those), and have those provided as built in options. Because the feature
is so new, I don't see value in trying to work hard in making its
implementation pluggable (internal APIs need to be flushed out, …) , much
prefer to work harder in enabling different usage patterns that can be used
by all users.

Regarding generic work on documents across all matches of a query,
those typically fall under the facets case, but it really depends on the
use case. I do see a place where people will just want to write complete
custom logic for both the scatter part and the reduce part, we need to
enable that. Obviously, the nature of the custom logic differs, but if
its aggregations, facet is where it fits.

Last, we do allow for custom implementations in many places, typically
driven in where we feel comfortable at enabling it (a combination of the
level of confidence we have with the internal APIs, not the external
ones). For example, we allow to plug custom Lucene constructs relatively
easily.

On Apr 14, 2013, at 12:50 PM, George Stathis gsta...@gmail.com wrote:

I was chatting with Simeon about this offline but I might as well add my
comment here. I think the idea about idempotence is a good one. Unless
there is a way to pass custom data around shards, that's pretty much what
needs to happen at first. I found that out the hard way trying to work on
SORL-2072 a while back and being stopped in my tracks by the networking
layer. The interfaces just didn't support passing around new fields and
custom data. It would be pretty much the same case here. Unless TopDoc and
SearchDoc are wrapped, there is not way to get more custom data passed
around the wire. The other comment that I made offline to Simeon is that to
do what he describes (have access to the entire result set) the pluggable
layer IMO probably needs to be in the org.elasticsearch.common.l**ucene
package in the form of custom collectors.

On Saturday, April 13, 2013 1:02:21 PM UTC-4, Simeon Simeonov wrote:

A friend from the ES community pointed out that it's not clear whether
what I write about must happen across shards or not.

Sure, a cross-shard solution that can also plug into the aggregation
node should be the long-term objective but there is a lot of value in
making the current per-shard rescore step pluggable without modifying
the aggregation. That's true for two reasons:

  1. Many problems may only require custom processing at the shard level

  2. Even problems that require custom processing at both the shard and
    aggregation level would benefit from the processing distribution and data
    locality of sharding.

The only problems that will not benefit are the ones that must be solved
at the aggregation level. This is the minority of problems.

The analogy here is map/reduce processing. The reduce operation should
be idempotent http://en.wikipedia.org/wiki/Idempotence. If a hook at
the aggregation node is not available, the final reduce step can be
performed on the client--of course, net of needing to provision the
right data to the client via custom fields or whatever. If a hook is
available, it can be performed on ES.

The benefits begin to be unlocked with a custom step on the shards,
though. The current abstract API that works on TopDocs is a fine start.

On Saturday, April 13, 2013 12:02:18 AM UTC-4, Simeon Simeonov wrote:

This is a really cool idea. I can see so many uses for this--some in
development and some in production--including TDD/**debugging/reporting/analytics,
before we even get to manipulating the returned results.

The pattern here is no different--in the abstract sense--than that of a
stored procedure. Before we say that search engines and databases are
different, let's focus on the fact that they both provide high-performance,
runtime customizable data services. The same patterns of data generation &
use repeat themselves regardless of the specifics of the system. It doesn't
matter if it is storage (block or file), databases (SQL or NoSql),
integration (messaging or Web services), CMS or search. I've seen this
across half a dozen servers my companies have built over the years that
have been used by hundreds of thousands of developers in thousands of
companies.

The root cause is that there are some operations that should be
performed close to where the data is at and that need to look at an entire
result set as opposed to one result at a time. If these operation
cannot be done close to the data (on the ES cluster, in each shard, etc.),
then all the data needs to be shipped out on the wire to the client, which
can be very expensive. That's the reason behind stored procedures,
on-storage computing, the scriptability of NoSql stores such as Redis,
MongoDB & CouchDB and even the custom queries and calculated fields in ES.
Only the most specialized key-value stores, e.g., Cassandra & HBase, don't
offer this.

One of the very attractive things about ES is its scripting
extensibility. After a quick look at the docs and the code, I've found it
strange that there is no extensibility point that allows third party code
to operate on the entire query result set. Perhaps a more flexible
rescoring model can help with that? Unfortunately, right now rescoring
seems to be hard-coded. It's not like what the docs seem to imply: that the
architecture allows it and other rescoring models aren't written yet. That
type of hard-coded dependency feels a bit un-ES like...

To me, the question is not what other types of rescoring should be
implemented, which would be like asking what other types of queries should
be implemented in ES. How do we answer this question given that ES is used
by so many different people in so many different ways? A better question to
ask might be how to make ES follow the patterns of successful,
high-performance servers and allow for an extension point that operates on
the entire result set. It is called rescore now but I see it as a more
general transformation step, of which rescoring is a common use case and,
of which the current rescoring implementation is the one that made the best
sense to build first.

If that were available, the ES community would have a way to develop
and share rescoring/transformation modules in an easy way. That would
benefit everyone and would help ES grow faster. Without this capability,
one of two things will happen. Either these data-demanding operations will
be performed on the client or developers will be forced to fork the ES
codebase to fix the currently hard-coded approach. In the former case,
nothing usable could be shared with the community. In the latter case, as
with the current hard-coded implementation, nobody will have the incentives
to do it well and so there will be no useful pull request contributions.
So, the ultimate issue here is as much about technology as it is about
open-source community management.

Simon, assuming one wanted to make rescoring scriptable, how should one
approach adding this to ES?

On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:

On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:

Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the query
rescorer) which modifies the result set in-place. Future developments could
include dedicated rescore results if needed by the implemenation ie. a
pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it
does look like there are a number of abstract classes and interfaces to
allow alternative implementations. I am just not sure if there is a
standard way to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your
thoughts on this?

simon

Thanks,
Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.
**html http://sematext.com/spm/index.html

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

I'm interested in what Sebastian is describing here and if he's right about
piggy-backing on faceting not being suitable in the described use-case,
could anyone please suggest an alternate route? Ideally one that might
stand the chance of getting pulled into ES?

Thanks,
Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html

On Monday, April 15, 2013 6:45:38 PM UTC-4, Sebastian wrote:

Hi,

Quoting Shay: "Obviously, the nature of the custom logic differs, but
if its aggregations, facet is where it fits.".
I think in this case that is not a good fit for facets, as what's needed
is not only aggregating, ie. "get the max value of a certain field with a
group by restriction", but also manipulating the TopDocs, ie. "removing the
not-max-valued documents", and some information must be added to the
winning documents too (like ScriptField does).

This specific use case must do something like:

  1. run a ES query
  2. in each shard get the TopDocs and filter some documents according to a
    "group by" function and a "unique" restriction. Also add to the winning
    documents some extra fields like ScriptField does.
  3. in the calling node also execute the filtering step 2)

The general use case would be a map/reduce hook that could manipulate the
data in the shards locally, and later at reducing. Also there are some
calculations done from that reducing step that would need to be added to
the returned documents, like ScriptField does.

Do you think it might be a good addition to the current code base?
Would you be interested in coding it or either accepting a pull request
for this? For pull request, could you provide a bit of guidance to
implement this in a clean way?

Thanks,
Sebastian.

On Mon, Apr 15, 2013 at 4:58 PM, Otis Gospodnetic <otis.gos...@gmail.com<javascript:>

wrote:

Hi,

If I undertand this correctly, the hope is that higher-level
functionality can be exposed in simpler, less "intrusive" way and be used
to satisfy not a narrower, more specific use case, but be the enabler for N
different user-level features? That seems fine by me, as long as this
higher-level functionality can indeed satisfy things like what Simeon
brought up or what I initially asked about.

Regarding generic work on documents across all matches of a query,
those typically fall under the facets case, but it really depends on the
use case. I do see a place where people will just want to write complete
custom logic for both the scatter part and the reduce part, we need to
enable that. Obviously, the nature of the custom logic differs, but if
its aggregations, facet is where it fits.

Shay & Co, are you referring to something like
https://github.com/imotov/elasticsearch-facet-script ?
If not, is there something else in ES itself or elsewhere that we could
model things after that you could point us to?

Thanks,
Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html

On Monday, April 15, 2013 2:53:12 PM UTC-4, kimchy wrote:

Hey,

This thread has expanded quite a bit beyond what was originally
asked. I will simply explain the thought process that we go through in ES
itself. For us, the decision is quite simple to be honest, our goal is to
focus less about being able to plug custom (Java) implementations for
specific features, but instead enable similar capabilities to all users
through other means (i.e. custom logic). A good example is custom_score
query, sure, one can plug in a custom Lucene Query implementation, and
implement any custom scoring needed, but we prefer the custom_score route,
where we actually empower and enable all users to take advantage of it.

Regarding rescore, its a new feature. The first thing we need is to
start to flush out all the additional requirements out of it, and find a
way to enable all users (btw, the query rescorer covers quite a wide range
of those), and have those provided as built in options. Because the feature
is so new, I don't see value in trying to work hard in making its
implementation pluggable (internal APIs need to be flushed out, …) , much
prefer to work harder in enabling different usage patterns that can be used
by all users.

Regarding generic work on documents across all matches of a query,
those typically fall under the facets case, but it really depends on the
use case. I do see a place where people will just want to write complete
custom logic for both the scatter part and the reduce part, we need to
enable that. Obviously, the nature of the custom logic differs, but if
its aggregations, facet is where it fits.

Last, we do allow for custom implementations in many places,
typically driven in where we feel comfortable at enabling it (a combination
of the level of confidence we have with the internal APIs, not the
external ones). For example, we allow to plug custom Lucene constructs
relatively easily.

On Apr 14, 2013, at 12:50 PM, George Stathis gsta...@gmail.com wrote:

I was chatting with Simeon about this offline but I might as well add my
comment here. I think the idea about idempotence is a good one. Unless
there is a way to pass custom data around shards, that's pretty much what
needs to happen at first. I found that out the hard way trying to work on
SORL-2072 a while back and being stopped in my tracks by the networking
layer. The interfaces just didn't support passing around new fields and
custom data. It would be pretty much the same case here. Unless TopDoc and
SearchDoc are wrapped, there is not way to get more custom data passed
around the wire. The other comment that I made offline to Simeon is that to
do what he describes (have access to the entire result set) the pluggable
layer IMO probably needs to be in the org.elasticsearch.common.l**ucene
package in the form of custom collectors.

On Saturday, April 13, 2013 1:02:21 PM UTC-4, Simeon Simeonov wrote:

A friend from the ES community pointed out that it's not clear whether
what I write about must happen across shards or not.

Sure, a cross-shard solution that can also plug into the aggregation
node should be the long-term objective but there is a lot of value in
making the current per-shard rescore step pluggable without modifying
the aggregation. That's true for two reasons:

  1. Many problems may only require custom processing at the shard level

  2. Even problems that require custom processing at both the shard and
    aggregation level would benefit from the processing distribution and data
    locality of sharding.

The only problems that will not benefit are the ones that must be
solved at the aggregation level. This is the minority of problems.

The analogy here is map/reduce processing. The reduce operation should
be idempotent http://en.wikipedia.org/wiki/Idempotence. If a hook at
the aggregation node is not available, the final reduce step can be
performed on the client--of course, net of needing to provision the
right data to the client via custom fields or whatever. If a hook is
available, it can be performed on ES.

The benefits begin to be unlocked with a custom step on the shards,
though. The current abstract API that works on TopDocs is a fine start.

On Saturday, April 13, 2013 12:02:18 AM UTC-4, Simeon Simeonov wrote:

This is a really cool idea. I can see so many uses for this--some in
development and some in production--including TDD/**debugging/reporting/analytics,
before we even get to manipulating the returned results.

The pattern here is no different--in the abstract sense--than that of
a stored procedure. Before we say that search engines and databases are
different, let's focus on the fact that they both provide high-performance,
runtime customizable data services. The same patterns of data generation &
use repeat themselves regardless of the specifics of the system. It doesn't
matter if it is storage (block or file), databases (SQL or NoSql),
integration (messaging or Web services), CMS or search. I've seen this
across half a dozen servers my companies have built over the years that
have been used by hundreds of thousands of developers in thousands of
companies.

The root cause is that there are some operations that should be
performed close to where the data is at and that need to look at an entire
result set as opposed to one result at a time. If these operation
cannot be done close to the data (on the ES cluster, in each shard, etc.),
then all the data needs to be shipped out on the wire to the client, which
can be very expensive. That's the reason behind stored procedures,
on-storage computing, the scriptability of NoSql stores such as Redis,
MongoDB & CouchDB and even the custom queries and calculated fields in ES.
Only the most specialized key-value stores, e.g., Cassandra & HBase, don't
offer this.

One of the very attractive things about ES is its scripting
extensibility. After a quick look at the docs and the code, I've found it
strange that there is no extensibility point that allows third party code
to operate on the entire query result set. Perhaps a more flexible
rescoring model can help with that? Unfortunately, right now
rescoring seems to be hard-coded. It's not like what the docs seem to
imply: that the architecture allows it and other rescoring models aren't
written yet. That type of hard-coded dependency feels a bit un-ES like...

To me, the question is not what other types of rescoring should be
implemented, which would be like asking what other types of queries should
be implemented in ES. How do we answer this question given that ES is used
by so many different people in so many different ways? A better question to
ask might be how to make ES follow the patterns of successful,
high-performance servers and allow for an extension point that operates on
the entire result set. It is called rescore now but I see it as a more
general transformation step, of which rescoring is a common use case and,
of which the current rescoring implementation is the one that made the best
sense to build first.

If that were available, the ES community would have a way to develop
and share rescoring/transformation modules in an easy way. That would
benefit everyone and would help ES grow faster. Without this capability,
one of two things will happen. Either these data-demanding operations will
be performed on the client or developers will be forced to fork the ES
codebase to fix the currently hard-coded approach. In the former case,
nothing usable could be shared with the community. In the latter case, as
with the current hard-coded implementation, nobody will have the incentives
to do it well and so there will be no useful pull request contributions.
So, the ultimate issue here is as much about technology as it is about
open-source community management.

Simon, assuming one wanted to make rescoring scriptable, how should
one approach adding this to ES?

On Friday, April 12, 2013 3:01:16 AM UTC-4, simonw wrote:

On Friday, April 12, 2013 12:04:18 AM UTC+2, Otis Gospodnetic wrote:

Hi,

How does one plug in a custom Rescorer into ElasticSearch?
This is from Simon's writeup on query rescorer:

"
Currently the rescore API has only one implementation (the query
rescorer) which modifies the result set in-place. Future developments could
include dedicated rescore results if needed by the implemenation ie. a
pair-wise reranker.
"

Sounds like alternative implementations should be pluggable, and it
does look like there are a number of abstract classes and interfaces to
allow alternative implementations. I am just not sure if there is a
standard way to tell ES about my alternative rescorer... is there?

not yet, do you have any alternative in mind? can you share your
thoughts on this?

simon

Thanks,
Otis

ELASTICSEARCH Performance Monitoring -
http://sematext.com/spm/index.**htmlhttp://sematext.com/spm/index.html

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.