I've been looking for a way to extract n-gram frequencies from
ElasticSearch as though it were a large table of n-grams by documents. I
found this thread from about a year ago:
"3. The above, 1 and 2, talk about having map reduce implemented on the
"search" aspect. One thing that I would love to also tackle is the "terms"
aspect of a search engine. Being able to run (streaming) map reduce jobs on
terms, especially ones with term vector information, can provide a strong
infrastructure for implementing algos like clustering and the like.
So, yes, it has crossed my mind :), and it is on the roadmap."
I'm wondering what the status of this is today. Is something similar
supported in a different way? I could begin work on a plugin or I could
help with a module in development.
Nothing has happened on that front, though I still toy with the idea
On Thursday, March 8, 2012 at 10:07 PM, Jimmy Krehl wrote:
I've been looking for a way to extract n-gram frequencies from Elasticsearch as though it were a large table of n-grams by documents. I found this thread from about a year ago:
"3. The above, 1 and 2, talk about having map reduce implemented on the "search" aspect. One thing that I would love to also tackle is the "terms" aspect of a search engine. Being able to run (streaming) map reduce jobs on terms, especially ones with term vector information, can provide a strong infrastructure for implementing algos like clustering and the like.
So, yes, it has crossed my mind :), and it is on the roadmap."
I'm wondering what the status of this is today. Is something similar supported in a different way? I could begin work on a plugin or I could help with a module in development.
Is a search plugin the route to go? I'm pretty new to ES and I'm not if
there's a framework for those. I'm hoping to be able to leverage the
search infrastructure in ES to distribute the collation of n-grams.
Googling has lead to me to believe that people link ES's indices to HDFS
and use Mahout to extract TF/IDF data. I'd prefer using ES entirely,
however.
Thanks!
jimmyk
On Friday, March 9, 2012 10:58:53 AM UTC-8, kimchy wrote:
Nothing has happened on that front, though I still toy with the idea
On Thursday, March 8, 2012 at 10:07 PM, Jimmy Krehl wrote:
I've been looking for a way to extract n-gram frequencies from
Elasticsearch as though it were a large table of n-grams by documents. I
found this thread from about a year ago:
"3. The above, 1 and 2, talk about having map reduce implemented on the
"search" aspect. One thing that I would love to also tackle is the "terms"
aspect of a search engine. Being able to run (streaming) map reduce jobs on
terms, especially ones with term vector information, can provide a strong
infrastructure for implementing algos like clustering and the like.
So, yes, it has crossed my mind :), and it is on the roadmap."
I'm wondering what the status of this is today. Is something similar
supported in a different way? I could begin work on a plugin or I could
help with a module in development.
It should be possible with a plugin, and it might not be that difficult if you have a very specific use case.
On Friday, March 9, 2012 at 10:18 PM, Jimmy Krehl wrote:
Is a search plugin the route to go? I'm pretty new to ES and I'm not if there's a framework for those. I'm hoping to be able to leverage the search infrastructure in ES to distribute the collation of n-grams. Googling has lead to me to believe that people link ES's indices to HDFS and use Mahout to extract TF/IDF data. I'd prefer using ES entirely, however.
Thanks!
jimmyk
On Friday, March 9, 2012 10:58:53 AM UTC-8, kimchy wrote:
Nothing has happened on that front, though I still toy with the idea
On Thursday, March 8, 2012 at 10:07 PM, Jimmy Krehl wrote:
I've been looking for a way to extract n-gram frequencies from Elasticsearch as though it were a large table of n-grams by documents. I found this thread from about a year ago:
"3. The above, 1 and 2, talk about having map reduce implemented on the "search" aspect. One thing that I would love to also tackle is the "terms" aspect of a search engine. Being able to run (streaming) map reduce jobs on terms, especially ones with term vector information, can provide a strong infrastructure for implementing algos like clustering and the like.
So, yes, it has crossed my mind :), and it is on the roadmap."
I'm wondering what the status of this is today. Is something similar supported in a different way? I could begin work on a plugin or I could help with a module in development.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.