Complex custom relevance calculations and huge result sets

I am new to ElasticSearch and I am analysing whether it would be feasible
for us to build our search solution on top of it. One of our challenges is
that our relevance calculations are quite complex and we want to avoid that
it gets out of control with huge result sets and query time is critical.
Therefore we are interested in reducing the result set as much as possible
before doing more complex calculations. I am therefore looking for advice
how this could be solved using ElasticSearch.

We believe that the tf/idf relevance model will work well as the initial
filter where we are looking to cut off the most irrelevant items at the
first significant relevance drop. At the next stage we will have our first
custom filter where we based on preprocessed information about the items
calculate additional scores from the query input. Again, we like to keep
only the most relevant items and cut off at the first significant relevance
drop if there are still too many items in the set. Finally we will like to
apply the most complex relevance calculations on the remaining result set.

The thinking is that if we would do all this within the ElasticSearch
processing framework we could harness the benefits of its distributed
qualities to easily shard and parallelise search processing. It would also
give us a framework to experiment with different relevance/filtering
approaches but I am a bit worried about our ability to control the
processing pipeline and contain any runaway queries since we will have a
huge corpus. I feel that it is still very hard to get an overview of the
processing pipeline and what is happening at the different steps and how I
can control it.

I would appreciate any advice on how to approach our problem using
ElasticSearch.

--

Hello, using facets could help you to find most irrelevant or relevant
items in your differents steps.
Regards

2013/1/10 Mario mj@bioql.com

I am new to Elasticsearch and I am analysing whether it would be feasible
for us to build our search solution on top of it. One of our challenges is
that our relevance calculations are quite complex and we want to avoid that
it gets out of control with huge result sets and query time is critical.
Therefore we are interested in reducing the result set as much as possible
before doing more complex calculations. I am therefore looking for advice
how this could be solved using Elasticsearch.

We believe that the tf/idf relevance model will work well as the initial
filter where we are looking to cut off the most irrelevant items at the
first significant relevance drop. At the next stage we will have our first
custom filter where we based on preprocessed information about the items
calculate additional scores from the query input. Again, we like to keep
only the most relevant items and cut off at the first significant relevance
drop if there are still too many items in the set. Finally we will like to
apply the most complex relevance calculations on the remaining result set.

The thinking is that if we would do all this within the Elasticsearch
processing framework we could harness the benefits of its distributed
qualities to easily shard and parallelise search processing. It would also
give us a framework to experiment with different relevance/filtering
approaches but I am a bit worried about our ability to control the
processing pipeline and contain any runaway queries since we will have a
huge corpus. I feel that it is still very hard to get an overview of the
processing pipeline and what is happening at the different steps and how I
can control it.

I would appreciate any advice on how to approach our problem using
Elasticsearch.

--

--

Thanks for the input. It actually didn't occur to me right away to use
faceting although I can see my situation as a problem of narrowing search
results in a similar way. I am not yet familiar with how the facet
functionality in ES but maybe you could provide me with some more specifics
on how this would work in my case? I would need to use my own custom
scripts or native code that can perform my calculations, which needs to
lookup other additional data in another index to calculate a score I can
use to filter items with. As I said, I am all new to Elasticsearch, so I
still have to do some more experiments before I actually know my way
around. Maybe I just need to play around with it a little more before I get
this "aha"-experience :slight_smile:

On Thursday, January 10, 2013 2:01:26 PM UTC+1, theknto wrote:

Hello, using facets could help you to find most irrelevant or relevant
items in your differents steps.
Regards

2013/1/10 Mario <m...@bioql.com <javascript:>>

I am new to Elasticsearch and I am analysing whether it would be feasible
for us to build our search solution on top of it. One of our challenges is
that our relevance calculations are quite complex and we want to avoid that
it gets out of control with huge result sets and query time is critical.
Therefore we are interested in reducing the result set as much as possible
before doing more complex calculations. I am therefore looking for advice
how this could be solved using Elasticsearch.

We believe that the tf/idf relevance model will work well as the initial
filter where we are looking to cut off the most irrelevant items at the
first significant relevance drop. At the next stage we will have our first
custom filter where we based on preprocessed information about the items
calculate additional scores from the query input. Again, we like to keep
only the most relevant items and cut off at the first significant relevance
drop if there are still too many items in the set. Finally we will like to
apply the most complex relevance calculations on the remaining result set.

The thinking is that if we would do all this within the Elasticsearch
processing framework we could harness the benefits of its distributed
qualities to easily shard and parallelise search processing. It would also
give us a framework to experiment with different relevance/filtering
approaches but I am a bit worried about our ability to control the
processing pipeline and contain any runaway queries since we will have a
huge corpus. I feel that it is still very hard to get an overview of the
processing pipeline and what is happening at the different steps and how I
can control it.

I would appreciate any advice on how to approach our problem using
Elasticsearch.

--

--

Indeed, you should start some basic test to explore what ES can do.
Le 10 janv. 2013 21:46, "Mario" mj@bioql.com a écrit :

Thanks for the input. It actually didn't occur to me right away to use
faceting although I can see my situation as a problem of narrowing search
results in a similar way. I am not yet familiar with how the facet
functionality in ES but maybe you could provide me with some more specifics
on how this would work in my case? I would need to use my own custom
scripts or native code that can perform my calculations, which needs to
lookup other additional data in another index to calculate a score I can
use to filter items with. As I said, I am all new to Elasticsearch, so I
still have to do some more experiments before I actually know my way
around. Maybe I just need to play around with it a little more before I get
this "aha"-experience :slight_smile:

On Thursday, January 10, 2013 2:01:26 PM UTC+1, theknto wrote:

Hello, using facets could help you to find most irrelevant or relevant
items in your differents steps.
Regards

2013/1/10 Mario m...@bioql.com

I am new to Elasticsearch and I am analysing whether it would be
feasible for us to build our search solution on top of it. One of our
challenges is that our relevance calculations are quite complex and we want
to avoid that it gets out of control with huge result sets and query time
is critical. Therefore we are interested in reducing the result set as much
as possible before doing more complex calculations. I am therefore looking
for advice how this could be solved using Elasticsearch.

We believe that the tf/idf relevance model will work well as the initial
filter where we are looking to cut off the most irrelevant items at the
first significant relevance drop. At the next stage we will have our first
custom filter where we based on preprocessed information about the items
calculate additional scores from the query input. Again, we like to keep
only the most relevant items and cut off at the first significant relevance
drop if there are still too many items in the set. Finally we will like to
apply the most complex relevance calculations on the remaining result set.

The thinking is that if we would do all this within the Elasticsearch
processing framework we could harness the benefits of its distributed
qualities to easily shard and parallelise search processing. It would also
give us a framework to experiment with different relevance/filtering
approaches but I am a bit worried about our ability to control the
processing pipeline and contain any runaway queries since we will have a
huge corpus. I feel that it is still very hard to get an overview of the
processing pipeline and what is happening at the different steps and how I
can control it.

I would appreciate any advice on how to approach our problem using
Elasticsearch.

--

--

--