Creating a tf-idf for unigrams in a corpus using elasticsearch?

Hi Everyone,
I'm interested in creating a tf-idf matrix for unigrams in a corpus of
documents I have stored in elastic search.

I have searched the list, and from the results, I think that this perhaps
is not possible out of the box using the elasticsearch API, but wanted to
confirm before I start coding up a solution. Is there any way to calculate
tf-idf for every term in a corpus using the API?

I hope the question is clear, I'm a bit new to this space, and learning as
I go.

thanks,

--paul

--

There is a plugin that will return the term frequences:

You can see how the plugin uses standard Lucene code to calculate the term
frequencies at:
https://github.com/jprante/elasticsearch-index-termlist/blob/master/src/main/java/org/elasticsearch/action/termlist/TransportTermlistAction.java#L126

IDF would be trickier. Not sure if anything is exposed to calculate the
IDF. The difficulty in Elasticsearch is that an index is distributed, so no
Lucene index has all of the terms.

Cheers,

Ivan

On Mon, Jan 14, 2013 at 9:41 AM, Paul Sanwald paul@redowlanalytics.comwrote:

Hi Everyone,
I'm interested in creating a tf-idf matrix for unigrams in a corpus of
documents I have stored in Elasticsearch.

I have searched the list, and from the results, I think that this perhaps
is not possible out of the box using the elasticsearch API, but wanted to
confirm before I start coding up a solution. Is there any way to calculate
tf-idf for every term in a corpus using the API?

I hope the question is clear, I'm a bit new to this space, and learning as
I go.

thanks,

--paul

--

--