I'm trying to retrieve the IDF of a few particular terms from an
elasticsearch cluster. Though I have plenty of Java experience, I'm only
somewhat familiar with Lucene and know nothing about ES internals. What
would be the path of least resistance here?
Since IDF is neither a document nor a field of a document, the only
place to generate a dynamic number is in a facet.
I would look into what term facet or terms_stat or statistical facet can
do and then register a native script which roots around in Java API to
find the right value to return instead of the usual values it returns.
I think I saw somewhere where you could get the right Lucene objects for
asking for the IDF.
If you can get the Lucene IndexSearcher you can call docFreq(term).
Will that help? After all the claim is that
idf = log(numDocs/(docFreq+1)) + 1
Since any of these facets is over all docs in the query, I guess you'd
have to do this facet for one document (any document), but not all
documents, since IDF is for the entire index for a term.
Keep in mind that you can send params into a script, so maybe:
Since IDF is neither a document nor a field of a document, the only place
to generate a dynamic number is in a facet.
I would look into what term facet or terms_stat or statistical facet can
do and then register a native script which roots around in Java API to find
the right value to return instead of the usual values it returns.
I think I saw somewhere where you could get the right Lucene objects for
asking for the IDF.
If you can get the Lucene IndexSearcher you can call docFreq(term). Will
that help? After all the claim is that
idf = log(numDocs/(docFreq+1)) + 1
Since any of these facets is over all docs in the query, I guess you'd
have to do this facet for one document (any document), but not all
documents, since IDF is for the entire index for a term.
Keep in mind that you can send params into a script, so maybe:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.