Hello All,
We are providing full text search support on structure & unstructured
documents to the enduser using elasticsearch. Currently we have of around
100GB of indexed documents in elasticsearch index store. It will surely
extend to 100TB in future. We are good in setting up our environment and
the index model for our documents, also with basic querying over the
documents using Query & Filter clauses. Elasticsearch is really good &
flexible to bring us up to this level. Thanks for whole team.
Now, we are in the phase of retrieving relevant information for the users
from elasticsearch index store. We dont have index time boost factors, so
we planned to do at query time.
*Example:
**UserA *sponsors 3 football teams(FTeamA, FTeamB, FTeamC) and owners age
is 36
*UserB *sponsors 26 cricket teams (CTeamA, CTeamB, ... CteamZ)and owners
age is 36
*UserC *sponsors 1 football team(FTeamA), 10 cricket teams (CTeamA, CTeamB,
... CteamJ) and owners age is 36
*UserD *sponsors 2 cricket teams (CTeamA, CTeamB)and owners age is 36
*AudienceA *interested in FTeamA,FTeamB and CTeamA, CTeamB, CTeamC, CTeamDsearch for owner with his details (as search text) whose age is 36
Order of results AudienceA should get is
*1. UserA *(as 2 football teams)
*2. UserC *(as 1 football teams & 4 cricket teams)
*3. UserB *(as 4 cricket teams)
*4. UserD *(as 2 cricket teams)
We are able to bring this result, but just providing the high boost value
BOOST value for each football team matches = (MAX_MATCH cricket team *
BOOST value of cricket team) + DEFAULT BOOST value of football team
BOOST value for each football team matches = (26 * 2.2) + 4.2 => 61.4
(for each football team match)
*UserA *(as 2 football teams) = (61.4 * 2) = 122.8 + weight (field & query)
*UserC *(as 1 football teams & 4 cricket teams) = 61.4 + (2.2 * 4) +
weight (field & query)
*UserB *(as 4 cricket teams) = (2.2 * 4) + weight (field & query)
*UserD *(as 2 cricket teams) = (2.2 * 2) + weight (field & query)
We planned to provide boost factor for football team is higher than cricket
team, and boost factor for cricket team is higher than age.
We use query string with bool filters, and default operator is *AND *for
query_string.
We use script scoring function to iterate and increase the score for
individual matches.
Just tried a sample by customizing lucene scoring algorithm, and the way
elasticseach allows us to customize it using CustomSimilarity. Will try to
make use of it in getting more relevant document along with field based
scoring we having.
It will be good if somebody guides us in below:
- Is this the correct approach to handle this kind of scenarios.
- Score created via lucene similarity algorithm is totally depressed.
(Still helps us in some situations) Will there be any problems when
document nos. & size increases. - Is it good to have high score > 500, (if we are not controlling tf,idf,
& norms). Will this spoil the whole concept of retrieving relevant
documents. - Also we lose relevant documents when we add OR operator for query_string
and query terms > 1 in which more relevance is in UserC but moved to 2nd.
Suggestions please.
Thanks
Manikandan Pounraj
--