Easiest way to get the IDF of a term from an ES cluster


(Matt Luongo) #1

Hi folks,

I'm trying to retrieve the IDF of a few particular terms from an
elasticsearch cluster. Though I have plenty of Java experience, I'm only
somewhat familiar with Lucene and know nothing about ES internals. What
would be the path of least resistance here?

Thanks in advance!

  • Matt

--


(phill) #2

Since IDF is neither a document nor a field of a document, the only
place to generate a dynamic number is in a facet.
I would look into what term facet or terms_stat or statistical facet can
do and then register a native script which roots around in Java API to
find the right value to return instead of the usual values it returns.

I think I saw somewhere where you could get the right Lucene objects for
asking for the IDF.
If you can get the Lucene IndexSearcher you can call docFreq(term).
Will that help? After all the claim is that

idf = log(numDocs/(docFreq+1)) + 1

Since any of these facets is over all docs in the query, I guess you'd
have to do this facet for one document (any document), but not all
documents, since IDF is for the entire index for a term.

Keep in mind that you can send params into a script, so maybe:

"statistical" : {
"script" : "someCraxyBitOfJava",
"params" : {
"idfForTerm" : "foo"
}
}

I think that is 3 round pegs in 2.5 square holes, but I think it might
work. :slight_smile:

-Paul

--


(Matt Luongo) #3

Thanks Paul! docFreq() would definitely help. I'll give this one a shot
soon and report back.

--
Matt Luongo
Software Developer
about.me/luongo

On Wed, Sep 5, 2012 at 12:50 AM, P.Hill parehill1@gmail.com wrote:

Since IDF is neither a document nor a field of a document, the only place
to generate a dynamic number is in a facet.
I would look into what term facet or terms_stat or statistical facet can
do and then register a native script which roots around in Java API to find
the right value to return instead of the usual values it returns.

I think I saw somewhere where you could get the right Lucene objects for
asking for the IDF.
If you can get the Lucene IndexSearcher you can call docFreq(term). Will
that help? After all the claim is that

idf = log(numDocs/(docFreq+1)) + 1

Since any of these facets is over all docs in the query, I guess you'd
have to do this facet for one document (any document), but not all
documents, since IDF is for the entire index for a term.

Keep in mind that you can send params into a script, so maybe:

"statistical" : {
"script" : "someCraxyBitOfJava",
"params" : {
"idfForTerm" : "foo"
}
}

I think that is 3 round pegs in 2.5 square holes, but I think it might
work. :slight_smile:

-Paul

--

--


(phill) #4

But after I wrote that I thought I also recall someone mentioning a
plug-in that could help with such index-level information.

-Paul

On 9/6/2012 1:19 PM, Matt Luongo wrote:

Thanks Paul! docFreq() would definitely help. I'll give this one a
shot soon and report back.

--


(Matt Luongo) #5

Maybe https://github.com/jprante/elasticsearch-skywalker? It looks like it
can return high frequency terms, but not docFreq for a particular term.

--
Matt Luongo
Software Developer
about.me/luongo

On Thu, Sep 6, 2012 at 8:18 PM, P. Hill parehill1@gmail.com wrote:

But after I wrote that I thought I also recall someone mentioning a
plug-in that could help with such index-level information.

-Paul

On 9/6/2012 1:19 PM, Matt Luongo wrote:

Thanks Paul! docFreq() would definitely help. I'll give this one a shot
soon and report back.

--

--


(system) #6