Question regarding sharding when fetching term vector and TFIDF info

Clint_Miller · August 14, 2013, 2:23pm

I've written a native script plugin to elasticsearch that extends
AbstractSearchScript. My code grabs a handle to the
org.apache.lucene.index.IndexReader and the docId. It then uses the
IndexReader to fetch the term vector for the docId, the frequency of each
term within the document, and the number of documents containing that term.
All of that data is then stored in a json string which is sent back to the
code performing the query.

We're currently on elasticsearch 0.20.5 (slightly old). I'm using the
following IndexReader call to get the document count:

reader.docFreq()

I'm pretty sure my code will need to change when we upgrade to the latest
elasticsearch as the Lucene IndexReader methods appear to have changed. No
big deal.

My question: Is reader.docFreq() shard-aware? I'm assuming not. I'm
assuming I'll only get the document count within the current shard.

If my assumption is correct, is there a way to get the document count for a
term across all shards? Is there a way for my code to access the
IndexReader's for the other shards? If there were, then I could just call
reader.docFreq() for each shard and add the results.

(I'd happily upgrade to the latest elasticsearch if it makes it easier to
solve this problem.)

Thanks for any help.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clint_Miller · August 14, 2013, 7:44pm

I might have figured this out. Instead of getting the IndexReader by
overriding the AbstractSearchScript method setNextReader(), I'm making the
following call:

val searchContext = org.elasticsearch.search.internal.SearchContext.current
searchContext.searcher.getIndexReader.docFreq(currTerm)

That seems to return a higher-level reader that rolls up the doc counts
from a set of lower-level readers that might represent the shards?

I can also call searchContext.searcher.subReaders to get an array of child
readers and call docFreq() on each. If I add all these up, I get the same
value as searchContext.searcher.getIndexReader.docFreq().

Am I on the right track?

On Wednesday, August 14, 2013 9:23:52 AM UTC-5, Clint Miller wrote:

I've written a native script plugin to elasticsearch that extends
AbstractSearchScript. My code grabs a handle to the
org.apache.lucene.index.IndexReader and the docId. It then uses the
IndexReader to fetch the term vector for the docId, the frequency of each
term within the document, and the number of documents containing that term.
All of that data is then stored in a json string which is sent back to the
code performing the query.

We're currently on elasticsearch 0.20.5 (slightly old). I'm using the
following IndexReader call to get the document count:

reader.docFreq()

I'm pretty sure my code will need to change when we upgrade to the latest
elasticsearch as the Lucene IndexReader methods appear to have changed. No
big deal.

My question: Is reader.docFreq() shard-aware? I'm assuming not. I'm
assuming I'll only get the document count within the current shard.

If my assumption is correct, is there a way to get the document count for
a term across all shards? Is there a way for my code to access the
IndexReader's for the other shards? If there were, then I could just call
reader.docFreq() for each shard and add the results.

(I'd happily upgrade to the latest elasticsearch if it makes it easier to
solve this problem.)

Thanks for any help.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · August 14, 2013, 8:22pm

IndexReader (or better AtomicReaderContext) is not exposed in the API for
use from a (native) script. In 0.90+ the situation has not changed.

You are correct, you wouldn't be able to get the doc freq of the index, you
would get the doc freq for the current shard only.

To compute the doc freq across all shards, you would have to write a plugin
with a custom action that can distribute a request on index level to shard
requests for the nodes that have the shards, and finally summarize the
subresults. An example plugin is here

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clint_Miller · August 16, 2013, 1:23am

Thank you very much. I was able to write a plugin based on your sample
code, and it's working perfectly for me.

On Wednesday, August 14, 2013 3:22:14 PM UTC-5, Jörg Prante wrote:

IndexReader (or better AtomicReaderContext) is not exposed in the API for
use from a (native) script. In 0.90+ the situation has not changed.

You are correct, you wouldn't be able to get the doc freq of the index,
you would get the doc freq for the current shard only.

To compute the doc freq across all shards, you would have to write a
plugin with a custom action that can distribute a request on index level to
shard requests for the nodes that have the shards, and finally summarize
the subresults. An example plugin is here
GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Outdated information using Java client TermVector and COUNT api Elasticsearch	1	399	July 6, 2017
Shard count and plugin questions Elasticsearch	14	505	July 6, 2017
Term Vector ttf from all shards Elasticsearch	1	672	July 6, 2017
Easiest way to get the IDF of a term from an ES cluster Elasticsearch	5	1232	July 6, 2017
Different IDF for different documents Elasticsearch	2	449	July 27, 2018

Question regarding sharding when fetching term vector and TFIDF info

Related topics