Topics/Entities with relevancy scores and searching


(Scott Decker) #1

Hey all,
a question on possible search paths/structure. If we have a text
document, and we have run our magic over it and come away with Topics and
Entities (Like, Barack Obama and Apple Inc.) and we have a relevancy score
for each one, what would be the best way to store and query against them?

we currently are trying a parent/child relationship, where the children are
the terms with their relevancy score and the scoring of the parent text
document gets done from the relevancy scores of the children. That works.
Just worried about speed of parent/child against millions of documents.

Another way we could think of was, build our own scorer/analyzer. If we
are reading in tokens like BarackObama.93345|AppleInc.0034
where it has the topic and the relevancy score to the document in it, i can
build an analyzer to read those sorts of tokens, but is there any way to
build a scorer that can use that token match data to score?

and third, is there any other way to normalize this data into one document
so we can score on it. That seems like it would be the fastest way to
query, but my #2 option here is the only way I can think of doing it.
Anyone else tagging their documents with relevancy scores to topics, on the
document and then letting people search for those topics and pulling back
the relevant docs based on the per document relevancy scores?

Thanks,
Scott

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9434db79-363f-4470-bf91-b960908c2de6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Clinton Gormley) #2

Have a look at:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-delimited-payload-tokenfilter.html
*
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-advanced-scripting.html

On 23 August 2014 15:04, Scott Decker scott@publishthis.com wrote:

Hey all,
a question on possible search paths/structure. If we have a text
document, and we have run our magic over it and come away with Topics and
Entities (Like, Barack Obama and Apple Inc.) and we have a relevancy score
for each one, what would be the best way to store and query against them?

we currently are trying a parent/child relationship, where the children
are the terms with their relevancy score and the scoring of the parent text
document gets done from the relevancy scores of the children. That works.
Just worried about speed of parent/child against millions of documents.

Another way we could think of was, build our own scorer/analyzer. If we
are reading in tokens like BarackObama.93345|AppleInc.0034
where it has the topic and the relevancy score to the document in it, i
can build an analyzer to read those sorts of tokens, but is there any way
to build a scorer that can use that token match data to score?

and third, is there any other way to normalize this data into one document
so we can score on it. That seems like it would be the fastest way to
query, but my #2 option here is the only way I can think of doing it.
Anyone else tagging their documents with relevancy scores to topics, on the
document and then letting people search for those topics and pulling back
the relevant docs based on the per document relevancy scores?

Thanks,
Scott

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9434db79-363f-4470-bf91-b960908c2de6%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/9434db79-363f-4470-bf91-b960908c2de6%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKQmp%3D4Tjft6REtvWZ-2%3DmhqPqnnk-OXtK3oPqw4iNFJmw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Scott Decker) #3

Interesting.
so, set a payload on the term, in this case the topic/entity, and the
payload is the relevancy value. Then, you can do your function score on the
query of the main documents themselves, no need for parent/child.

Have you done this? any concerns to performance with this sort of scoring,
or, it is just as fast if you were doing base lucene scoring if we override
the score function and just use our own?
-- we will of course try it and run our own performance tests, just looking
to see if you all ready have any insights.

Super helpful!
Scott

On Saturday, August 23, 2014 7:50:18 AM UTC-7, Clinton Gormley wrote:

Have a look at:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-delimited-payload-tokenfilter.html
*
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-advanced-scripting.html

On 23 August 2014 15:04, Scott Decker <sc...@publishthis.com <javascript:>

wrote:

Hey all,
a question on possible search paths/structure. If we have a text
document, and we have run our magic over it and come away with Topics and
Entities (Like, Barack Obama and Apple Inc.) and we have a relevancy score
for each one, what would be the best way to store and query against them?

we currently are trying a parent/child relationship, where the children
are the terms with their relevancy score and the scoring of the parent text
document gets done from the relevancy scores of the children. That works.
Just worried about speed of parent/child against millions of documents.

Another way we could think of was, build our own scorer/analyzer. If we
are reading in tokens like BarackObama.93345|AppleInc.0034
where it has the topic and the relevancy score to the document in it, i
can build an analyzer to read those sorts of tokens, but is there any way
to build a scorer that can use that token match data to score?

and third, is there any other way to normalize this data into one
document so we can score on it. That seems like it would be the fastest way
to query, but my #2 option here is the only way I can think of doing it.
Anyone else tagging their documents with relevancy scores to topics, on the
document and then letting people search for those topics and pulling back
the relevant docs based on the per document relevancy scores?

Thanks,
Scott

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9434db79-363f-4470-bf91-b960908c2de6%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/9434db79-363f-4470-bf91-b960908c2de6%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b3dd847f-99dc-4bad-9a2c-da9b6337ed8c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Clinton Gormley) #4

On 24 August 2014 19:46, Scott Decker scott@publishthis.com wrote:

Have you done this? any concerns to performance with this sort of scoring,
or, it is just as fast if you were doing base lucene scoring if we override
the score function and just use our own?
-- we will of course try it and run our own performance tests, just
looking to see if you all ready have any insights.

I haven't benchmarked it myself. Obviously accessing payloads is slower
than not, and some further work could be done on the scripting side to
cache some term statistics lookups, but I don't know how performance will
compare to doing this natively.

Would be interested in your feedback

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKQi%3DLMo83S6w-LZyrGz%3DD3gHPf0B1ZbU-EGkS6p9c9jPA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(hespoddi) #5

I'm curious: Using the delimited_payload_filter, how do you know which term
in delimited list was hit by query in script? From the "text scoring in
scripts" documentation, it seems you have to know the term:

_index['FIELD'].get('TERM', _PAYLOADS)

Is the matched term accessible in the script in some way?

On Monday, August 25, 2014 6:49:01 AM UTC-4, Clinton Gormley wrote:

On 24 August 2014 19:46, Scott Decker <sc...@publishthis.com <javascript:>

wrote:

Have you done this? any concerns to performance with this sort of
scoring, or, it is just as fast if you were doing base lucene scoring if we
override the score function and just use our own?
-- we will of course try it and run our own performance tests, just
looking to see if you all ready have any insights.

I haven't benchmarked it myself. Obviously accessing payloads is slower
than not, and some further work could be done on the scripting side to
cache some term statistics lookups, but I don't know how performance will
compare to doing this natively.

Would be interested in your feedback

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ba060bc0-a9bc-4cfd-b0fd-29167022f249%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6