ElasticSearch - Suggestions on retrieving relevant documents - Query-I


(pounraj.manikandan) #1

Hello All,

We are providing full text search support on structure & unstructured
documents to the enduser using elasticsearch. Currently we have of around
100GB of indexed documents in elasticsearch index store. It will surely
extend to 100TB in future. We are good in setting up our environment and
the index model for our documents, also with basic querying over the
documents using Query & Filter clauses. Elasticsearch is really good &
flexible to bring us up to this level. Thanks for whole team.

Now, we are in the phase of retrieving relevant information for the users
from elasticsearch index store. We dont have index time boost factors, so
we planned to do at query time.

*Example:

**UserA *sponsors 3 football teams(FTeamA, FTeamB, FTeamC) and owners age
is 36
*UserB *sponsors 26 cricket teams (CTeamA, CTeamB, ... CteamZ)and owners
age is 36
*UserC *sponsors 1 football team(FTeamA), 10 cricket teams (CTeamA, CTeamB,
... CteamJ) and owners age is 36
*UserD *sponsors 2 cricket teams (CTeamA, CTeamB)and owners age is 36

*AudienceA *interested in FTeamA,FTeamB and CTeamA, CTeamB, CTeamC, CTeamDsearch for owner with his details (as search text) whose age is 36

Order of results AudienceA should get is
*1. UserA *(as 2 football teams)
*2. UserC *(as 1 football teams & 4 cricket teams)
*3. UserB *(as 4 cricket teams)
*4. UserD *(as 2 cricket teams)

We are able to bring this result, but just providing the high boost value
BOOST value for each football team matches = (MAX_MATCH cricket team *
BOOST value of cricket team) + DEFAULT BOOST value of football team
BOOST value for each football team matches = (26 * 2.2) + 4.2 => 61.4
(for each football team match)

*UserA *(as 2 football teams) = (61.4 * 2) = 122.8 + weight (field & query)
*UserC *(as 1 football teams & 4 cricket teams) = 61.4 + (2.2 * 4) +
weight (field & query)
*UserB *(as 4 cricket teams) = (2.2 * 4) + weight (field & query)
*UserD *(as 2 cricket teams) = (2.2 * 2) + weight (field & query)

We planned to provide boost factor for football team is higher than cricket
team, and boost factor for cricket team is higher than age.

We use query string with bool filters, and default operator is *AND *for
query_string.
We use script scoring function to iterate and increase the score for
individual matches.

Just tried a sample by customizing lucene scoring algorithm, and the way
elasticseach allows us to customize it using CustomSimilarity. Will try to
make use of it in getting more relevant document along with field based
scoring we having.

It will be good if somebody guides us in below:

  1. Is this the correct approach to handle this kind of scenarios.
  2. Score created via lucene similarity algorithm is totally depressed.
    (Still helps us in some situations) Will there be any problems when
    document nos. & size increases.
  3. Is it good to have high score > 500, (if we are not controlling tf,idf,
    & norms). Will this spoil the whole concept of retrieving relevant
    documents.
  4. Also we lose relevant documents when we add OR operator for query_string
    and query terms > 1 in which more relevance is in UserC but moved to 2nd.
    Suggestions please.

Thanks
Manikandan Pounraj

--


(Chris Male) #2

Hi,

Sounds like a great application you're building.

On Thursday, October 11, 2012 8:50:56 PM UTC+13, Manikandan Pounraj wrote:

Hello All,

We are providing full text search support on structure & unstructured
documents to the enduser using elasticsearch. Currently we have of around
100GB of indexed documents in elasticsearch index store. It will surely
extend to 100TB in future. We are good in setting up our environment and
the index model for our documents, also with basic querying over the
documents using Query & Filter clauses. Elasticsearch is really good &
flexible to bring us up to this level. Thanks for whole team.

Now, we are in the phase of retrieving relevant information for the users
from elasticsearch index store. We dont have index time boost factors, so
we planned to do at query time.

*Example:

**UserA *sponsors 3 football teams(FTeamA, FTeamB, FTeamC) and owners age
is 36
*UserB *sponsors 26 cricket teams (CTeamA, CTeamB, ... CteamZ)and owners
age is 36
*UserC *sponsors 1 football team(FTeamA), 10 cricket teams (CTeamA,
CTeamB, ... CteamJ) and owners age is 36
*UserD *sponsors 2 cricket teams (CTeamA, CTeamB)and owners age is 36

*AudienceA *interested in FTeamA,FTeamB and CTeamA, CTeamB, CTeamC, CTeamDsearch for owner with his details (as search text) whose age is 36

Order of results AudienceA should get is
*1. UserA *(as 2 football teams)
*2. UserC *(as 1 football teams & 4 cricket teams)
*3. UserB *(as 4 cricket teams)
*4. UserD *(as 2 cricket teams)

We are able to bring this result, but just providing the high boost value
BOOST value for each football team matches = (MAX_MATCH cricket team *
BOOST value of cricket team) + DEFAULT BOOST value of football team
BOOST value for each football team matches = (26 * 2.2) + 4.2 => 61.4
(for each football team match)

*UserA *(as 2 football teams) = (61.4 * 2) = 122.8 + weight (field &
query)
*UserC *(as 1 football teams & 4 cricket teams) = 61.4 + (2.2 * 4) +
weight (field & query)
*UserB *(as 4 cricket teams) = (2.2 * 4) + weight (field & query)
*UserD *(as 2 cricket teams) = (2.2 * 2) + weight (field & query)

We planned to provide boost factor for football team is higher than
cricket team, and boost factor for cricket team is higher than age.

We use query string with bool filters, and default operator is *AND *for
query_string.
We use script scoring function to iterate and increase the score for
individual matches.

Just tried a sample by customizing lucene scoring algorithm, and the way
elasticseach allows us to customize it using CustomSimilarity. Will try to
make use of it in getting more relevant document along with field based
scoring we having.

It will be good if somebody guides us in below:

  1. Is this the correct approach to handle this kind of scenarios.

From what you've said it sounds like a good way to go about it. If I
understand correctly, you're using a script for scoring currently but have
tried using a CustomSimilarity, is that right? Given that your score is
entirely related to your content and not to traditional scoring factors
(such as tf and idf), I'd probably avoid the CustomSimilarity approach and
stick to your script.

  1. Score created via lucene similarity algorithm is totally depressed.
    (Still helps us in some situations) Will there be any problems when
    document nos. & size increases.

There won't be any impact.

  1. Is it good to have high score > 500, (if we are not controlling tf,idf,
    & norms). Will this spoil the whole concept of retrieving relevant
    documents.

I find when scores get that high it can be hard to add new properties that
have any influence. Eventually you begin throwing massive values at the
score just to make any difference. An alternative way to structure scoring
is to score down documents that aren't so relevant. This makes the scores
a lot smaller and easier to influence. This approach might not work for
your use case of course.

  1. Also we lose relevant documents when we add OR operator for
    query_string and query terms > 1 in which more relevance is in UserC but
    moved to 2nd. Suggestions please.

Are you able to provide more information about this? Maybe an example of
the query_string and what it matches.

Thanks
Manikandan Pounraj

--


(pounraj.manikandan) #3

Hello Chris,

I haven't checked your inline comments, just now luckily I read the comments again by checking 3 replies (wrongly for previous post). So, apologies for delay in reply.

We did a test without CustomSimilarity and score created via script. It works as expected, but planning to test with large amount of data. What you have mentioned is also the same. I really thank for that, which gives me more confident on our approach.

Regarding 4th suggestion:
Lets consider below two documents & its script score.
DocumentA: chris has replied for my queries in elasticsearch users nabble.com
180 script score
DocumentB: chris has replied for my queries in elasticsearch users nabble.com, but mani didnt had a look properly.
200 script score
{
"query_string": {
"query": "chris -mani +elasticsearch"
}
}

According to script score: DocumentB will be 1st and DocumentA will be 2nd.

But, DocumentA is more relevant than DocumentB.

I believe this is correct according to our requirement as script scores are not same.

Thanks again Chris

Pounraj Manikandan


(system) #4