Terms / Documents Matrix

I've been looking for a way to extract n-gram frequencies from
ElasticSearch as though it were a large table of n-grams by documents. I
found this thread from about a year ago:

http://elasticsearch-users.115913.n3.nabble.com/Pseudo-map-reduce-for-searchresults-td2683300.html

"3. The above, 1 and 2, talk about having map reduce implemented on the
"search" aspect. One thing that I would love to also tackle is the "terms"
aspect of a search engine. Being able to run (streaming) map reduce jobs on
terms, especially ones with term vector information, can provide a strong
infrastructure for implementing algos like clustering and the like.

So, yes, it has crossed my mind :), and it is on the roadmap."

I'm wondering what the status of this is today. Is something similar
supported in a different way? I could begin work on a plugin or I could
help with a module in development.

Thanks,
Jim Krehl

Nothing has happened on that front, though I still toy with the idea :slight_smile:

On Thursday, March 8, 2012 at 10:07 PM, Jimmy Krehl wrote:

I've been looking for a way to extract n-gram frequencies from Elasticsearch as though it were a large table of n-grams by documents. I found this thread from about a year ago:

http://elasticsearch-users.115913.n3.nabble.com/Pseudo-map-reduce-for-searchresults-td2683300.html

"3. The above, 1 and 2, talk about having map reduce implemented on the "search" aspect. One thing that I would love to also tackle is the "terms" aspect of a search engine. Being able to run (streaming) map reduce jobs on terms, especially ones with term vector information, can provide a strong infrastructure for implementing algos like clustering and the like.

So, yes, it has crossed my mind :), and it is on the roadmap."

I'm wondering what the status of this is today. Is something similar supported in a different way? I could begin work on a plugin or I could help with a module in development.

Thanks,
Jim Krehl

Is a search plugin the route to go? I'm pretty new to ES and I'm not if
there's a framework for those. I'm hoping to be able to leverage the
search infrastructure in ES to distribute the collation of n-grams.
Googling has lead to me to believe that people link ES's indices to HDFS
and use Mahout to extract TF/IDF data. I'd prefer using ES entirely,
however.

Thanks!
jimmyk

On Friday, March 9, 2012 10:58:53 AM UTC-8, kimchy wrote:

Nothing has happened on that front, though I still toy with the idea :slight_smile:

On Thursday, March 8, 2012 at 10:07 PM, Jimmy Krehl wrote:

I've been looking for a way to extract n-gram frequencies from
Elasticsearch as though it were a large table of n-grams by documents. I
found this thread from about a year ago:

http://elasticsearch-users.115913.n3.nabble.com/Pseudo-map-reduce-for-searchresults-td2683300.html

"3. The above, 1 and 2, talk about having map reduce implemented on the
"search" aspect. One thing that I would love to also tackle is the "terms"
aspect of a search engine. Being able to run (streaming) map reduce jobs on
terms, especially ones with term vector information, can provide a strong
infrastructure for implementing algos like clustering and the like.

So, yes, it has crossed my mind :), and it is on the roadmap."

I'm wondering what the status of this is today. Is something similar
supported in a different way? I could begin work on a plugin or I could
help with a module in development.

Thanks,
Jim Krehl

It should be possible with a plugin, and it might not be that difficult if you have a very specific use case.

On Friday, March 9, 2012 at 10:18 PM, Jimmy Krehl wrote:

Is a search plugin the route to go? I'm pretty new to ES and I'm not if there's a framework for those. I'm hoping to be able to leverage the search infrastructure in ES to distribute the collation of n-grams. Googling has lead to me to believe that people link ES's indices to HDFS and use Mahout to extract TF/IDF data. I'd prefer using ES entirely, however.

Thanks!
jimmyk

On Friday, March 9, 2012 10:58:53 AM UTC-8, kimchy wrote:

Nothing has happened on that front, though I still toy with the idea :slight_smile:

On Thursday, March 8, 2012 at 10:07 PM, Jimmy Krehl wrote:

I've been looking for a way to extract n-gram frequencies from Elasticsearch as though it were a large table of n-grams by documents. I found this thread from about a year ago:

http://elasticsearch-users.115913.n3.nabble.com/Pseudo-map-reduce-for-searchresults-td2683300.html

"3. The above, 1 and 2, talk about having map reduce implemented on the "search" aspect. One thing that I would love to also tackle is the "terms" aspect of a search engine. Being able to run (streaming) map reduce jobs on terms, especially ones with term vector information, can provide a strong infrastructure for implementing algos like clustering and the like.

So, yes, it has crossed my mind :), and it is on the roadmap."

I'm wondering what the status of this is today. Is something similar supported in a different way? I could begin work on a plugin or I could help with a module in development.

Thanks,
Jim Krehl