Hey all,
a question on possible search paths/structure. If we have a text
document, and we have run our magic over it and come away with Topics and
Entities (Like, Barack Obama and Apple Inc.) and we have a relevancy score
for each one, what would be the best way to store and query against them?
we currently are trying a parent/child relationship, where the children are
the terms with their relevancy score and the scoring of the parent text
document gets done from the relevancy scores of the children. That works.
Just worried about speed of parent/child against millions of documents.
Another way we could think of was, build our own scorer/analyzer. If we
are reading in tokens like BarackObama.93345|AppleInc.0034
where it has the topic and the relevancy score to the document in it, i can
build an analyzer to read those sorts of tokens, but is there any way to
build a scorer that can use that token match data to score?
and third, is there any other way to normalize this data into one document
so we can score on it. That seems like it would be the fastest way to
query, but my #2 option here is the only way I can think of doing it.
Anyone else tagging their documents with relevancy scores to topics, on the
document and then letting people search for those topics and pulling back
the relevant docs based on the per document relevancy scores?
Hey all,
a question on possible search paths/structure. If we have a text
document, and we have run our magic over it and come away with Topics and
Entities (Like, Barack Obama and Apple Inc.) and we have a relevancy score
for each one, what would be the best way to store and query against them?
we currently are trying a parent/child relationship, where the children
are the terms with their relevancy score and the scoring of the parent text
document gets done from the relevancy scores of the children. That works.
Just worried about speed of parent/child against millions of documents.
Another way we could think of was, build our own scorer/analyzer. If we
are reading in tokens like BarackObama.93345|AppleInc.0034
where it has the topic and the relevancy score to the document in it, i
can build an analyzer to read those sorts of tokens, but is there any way
to build a scorer that can use that token match data to score?
and third, is there any other way to normalize this data into one document
so we can score on it. That seems like it would be the fastest way to
query, but my #2 option here is the only way I can think of doing it.
Anyone else tagging their documents with relevancy scores to topics, on the
document and then letting people search for those topics and pulling back
the relevant docs based on the per document relevancy scores?
Interesting.
so, set a payload on the term, in this case the topic/entity, and the
payload is the relevancy value. Then, you can do your function score on the
query of the main documents themselves, no need for parent/child.
Have you done this? any concerns to performance with this sort of scoring,
or, it is just as fast if you were doing base lucene scoring if we override
the score function and just use our own?
-- we will of course try it and run our own performance tests, just looking
to see if you all ready have any insights.
Super helpful!
Scott
On Saturday, August 23, 2014 7:50:18 AM UTC-7, Clinton Gormley wrote:
Hey all,
a question on possible search paths/structure. If we have a text
document, and we have run our magic over it and come away with Topics and
Entities (Like, Barack Obama and Apple Inc.) and we have a relevancy score
for each one, what would be the best way to store and query against them?
we currently are trying a parent/child relationship, where the children
are the terms with their relevancy score and the scoring of the parent text
document gets done from the relevancy scores of the children. That works.
Just worried about speed of parent/child against millions of documents.
Another way we could think of was, build our own scorer/analyzer. If we
are reading in tokens like BarackObama.93345|AppleInc.0034
where it has the topic and the relevancy score to the document in it, i
can build an analyzer to read those sorts of tokens, but is there any way
to build a scorer that can use that token match data to score?
and third, is there any other way to normalize this data into one
document so we can score on it. That seems like it would be the fastest way
to query, but my #2 option here is the only way I can think of doing it.
Anyone else tagging their documents with relevancy scores to topics, on the
document and then letting people search for those topics and pulling back
the relevant docs based on the per document relevancy scores?
Have you done this? any concerns to performance with this sort of scoring,
or, it is just as fast if you were doing base lucene scoring if we override
the score function and just use our own?
-- we will of course try it and run our own performance tests, just
looking to see if you all ready have any insights.
I haven't benchmarked it myself. Obviously accessing payloads is slower
than not, and some further work could be done on the scripting side to
cache some term statistics lookups, but I don't know how performance will
compare to doing this natively.
I'm curious: Using the delimited_payload_filter, how do you know which term
in delimited list was hit by query in script? From the "text scoring in
scripts" documentation, it seems you have to know the term:
_index['FIELD'].get('TERM', _PAYLOADS)
Is the matched term accessible in the script in some way?
On Monday, August 25, 2014 6:49:01 AM UTC-4, Clinton Gormley wrote:
Have you done this? any concerns to performance with this sort of
scoring, or, it is just as fast if you were doing base lucene scoring if we
override the score function and just use our own?
-- we will of course try it and run our own performance tests, just
looking to see if you all ready have any insights.
I haven't benchmarked it myself. Obviously accessing payloads is slower
than not, and some further work could be done on the scripting side to
cache some term statistics lookups, but I don't know how performance will
compare to doing this natively.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.