LSA, pLSA, LDA and elasticsearch

Hi All,

I'm new of elasticsearch and some indexing techniques, but I'm very
interested in LSA,pLSA and LDA.
Basically I have no idea of how to start implementing one of them in
elasticsearch and what I should look at (source code,documentation,
plugins, whatever).
Can someone point me to a good approach to implement those algorithms?
Thank you in advance.
M@rco

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Marco --

You could use the information in the underlying Lucene indexes from
Elasticsearch to feed LSI or LDA, but implementing those techniques is
unrelated to Elasticsearch's core use case. For more background, you could
try Gensim http://radimrehurek.com/gensim/, Vowpal
Wabbithttps://github.com/JohnLangford/vowpal_wabbit/wiki/Latent-Dirichlet-Allocation,
or Mallet http://mallet.cs.umass.edu/.


prb@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

On Tue, Jul 30, 2013 at 12:49 AM, Marco Fago fago.marco@gmail.com wrote:

Hi All,

I'm new of elasticsearch and some indexing techniques, but I'm very
interested in LSA,pLSA and LDA.
Basically I have no idea of how to start implementing one of them in
elasticsearch and what I should look at (source code,documentation,
plugins, whatever).
Can someone point me to a good approach to implement those algorithms?
Thank you in advance.
M@rco

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Paul,

Thank you for the answer and the links.
Could please explain why it is unrelated by Elasticsearch?

M@rco

On Tuesday, July 30, 2013 10:15:41 AM UTC+2, Paul Brown wrote:

Hi, Marco --

You could use the information in the underlying Lucene indexes from
Elasticsearch to feed LSI or LDA, but implementing those techniques is
unrelated to Elasticsearch's core use case. For more background, you could
try Gensim http://radimrehurek.com/gensim/, Vowpal Wabbithttps://github.com/JohnLangford/vowpal_wabbit/wiki/Latent-Dirichlet-Allocation,
or Mallet http://mallet.cs.umass.edu/.


p...@mult.ifario.us <javascript:> | Multifarious, Inc. |
http://mult.ifario.us/

On Tue, Jul 30, 2013 at 12:49 AM, Marco Fago <fago....@gmail.com<javascript:>

wrote:

Hi All,

I'm new of elasticsearch and some indexing techniques, but I'm very
interested in LSA,pLSA and LDA.
Basically I have no idea of how to start implementing one of them in
elasticsearch and what I should look at (source code,documentation,
plugins, whatever).
Can someone point me to a good approach to implement those algorithms?
Thank you in advance.
M@rco

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Marco --

Maybe I shouldn't be so quick to be discouraging, but you're headed off
into the deepest corner of the deep end of the pool for your first swim
with Elasticsearch. That said...

Elasticsearch is a (very nice) wrapper for a Lucene. Lucene itself is an
information retrieval system that in its default configuration combines
boolean searches (for matching) with the vector space model (for scoring).
The javadocs for the Similarity and TFIDFSimilarity class are a good place
to start reading, and you can compare/contrast Elasticsearch docs on
similarityhttp://www.elasticsearch.org/guide/reference/index-modules/similarity/,
which is how the Lucene similarity implementations are manifest within
Elasticsearch.

LDA (to pick one of your items) is a hierarchical Bayesian model for a set
of documents based on latent distributions ("topics") over a vocabulary.

Lucene operates on terms; LDA operates on topics. And terms != topics.

You might find this Github project informative:

Best.
-- Paul


prb@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

On Tue, Jul 30, 2013 at 2:27 AM, Marco Fago fago.marco@gmail.com wrote:

Hi Paul,

Thank you for the answer and the links.
Could please explain why it is unrelated by Elasticsearch?

M@rco

On Tuesday, July 30, 2013 10:15:41 AM UTC+2, Paul Brown wrote:

Hi, Marco --

You could use the information in the underlying Lucene indexes from
Elasticsearch to feed LSI or LDA, but implementing those techniques is
unrelated to Elasticsearch's core use case. For more background, you could
try Gensim http://radimrehurek.com/gensim/, Vowpal Wabbithttps://github.com/JohnLangford/vowpal_wabbit/wiki/Latent-Dirichlet-Allocation,
or Mallet http://mallet.cs.umass.edu/.


p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

On Tue, Jul 30, 2013 at 12:49 AM, Marco Fago fago....@gmail.com wrote:

Hi All,

I'm new of elasticsearch and some indexing techniques, but I'm very
interested in LSA,pLSA and LDA.
Basically I have no idea of how to start implementing one of them in
elasticsearch and what I should look at (source code,documentation,
plugins, whatever).
Can someone point me to a good approach to implement those algorithms?
Thank you in advance.
M@rco

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Note that LSA is encumbered by a patent
http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=4839853 in the US,
and it is unlikely to get included into projects like Lucene or
Elasticsearch.

Jörg

On Wed, Jul 31, 2013 at 8:45 AM, Paul Brown prb@mult.ifario.us wrote:

Hi, Marco --

Maybe I shouldn't be so quick to be discouraging, but you're headed off
into the deepest corner of the deep end of the pool for your first swim
with Elasticsearch. That said...

Elasticsearch is a (very nice) wrapper for a Lucene. Lucene itself is an
information retrieval system that in its default configuration combines
boolean searches (for matching) with the vector space model (for scoring).
The javadocs for the Similarity and TFIDFSimilarity class are a good place
to start reading, and you can compare/contrast Elasticsearch docs on
similarityhttp://www.elasticsearch.org/guide/reference/index-modules/similarity/,
which is how the Lucene similarity implementations are manifest within
Elasticsearch.

LDA (to pick one of your items) is a hierarchical Bayesian model for a set
of documents based on latent distributions ("topics") over a vocabulary.

Lucene operates on terms; LDA operates on topics. And terms != topics.

You might find this Github project informative:
GitHub - stepthom/lucene-lda: Using latent Dirichlet allocation (LDA) in Apache Lucene

Best.
-- Paul


prb@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

On Tue, Jul 30, 2013 at 2:27 AM, Marco Fago fago.marco@gmail.com wrote:

Hi Paul,

Thank you for the answer and the links.
Could please explain why it is unrelated by Elasticsearch?

M@rco

On Tuesday, July 30, 2013 10:15:41 AM UTC+2, Paul Brown wrote:

Hi, Marco --

You could use the information in the underlying Lucene indexes from
Elasticsearch to feed LSI or LDA, but implementing those techniques is
unrelated to Elasticsearch's core use case. For more background, you could
try Gensim http://radimrehurek.com/gensim/, Vowpal Wabbithttps://github.com/JohnLangford/vowpal_wabbit/wiki/Latent-Dirichlet-Allocation,
or Mallet http://mallet.cs.umass.edu/.


p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

On Tue, Jul 30, 2013 at 12:49 AM, Marco Fago fago....@gmail.com wrote:

Hi All,

I'm new of elasticsearch and some indexing techniques, but I'm very
interested in LSA,pLSA and LDA.
Basically I have no idea of how to start implementing one of them in
elasticsearch and what I should look at (source code,documentation,
plugins, whatever).
Can someone point me to a good approach to implement those algorithms?
Thank you in advance.
M@rco

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Paul,

Thank you again for the clear answer.
I know that I've stumbled upon a very dark (or hard) side of ES, but that
makes things funnier.
I took a loko at the lucene-lda https://github.com/stepthom/lucene-ldaproject, which is very interesting but unfortunately does not implement the
online indexing of documents, I mean that it expects documents to be
already indexed with lda. To be honest, I'm not sure it uses mallet at all,
since I don't see any dependency from that library.
As you said, similarity is a good point to start. My idea (the only I have
at this moment) is to have a LDA indexer in the crawler, which extracts
topics and stores them in a ES document field, and uses it for boolean
usual queries. But it would not leverage all the power of LDA. About this
design, I have some concerns:
1)Given that the crawler is a river, LDA dataset grows as the crawler
collect new documents, thus old documents should be reindexed (and I don't
know how to do)
2)All the LDA logic (and storage) is pratically outside ES, and I would
like to find a way to better integrate them
3)My design does not extract topic from queries, but just use boolean
search (as a first step it is acceptable but I would like to elaborate
further this concept)

Am I on the right direction?
M@rco

On Wednesday, July 31, 2013 8:45:04 AM UTC+2, Paul Brown wrote:

Hi, Marco --

Maybe I shouldn't be so quick to be discouraging, but you're headed off
into the deepest corner of the deep end of the pool for your first swim
with Elasticsearch. That said...

Elasticsearch is a (very nice) wrapper for a Lucene. Lucene itself is an
information retrieval system that in its default configuration combines
boolean searches (for matching) with the vector space model (for scoring).
The javadocs for the Similarity and TFIDFSimilarity class are a good place
to start reading, and you can compare/contrast Elasticsearch docs on
similarityhttp://www.elasticsearch.org/guide/reference/index-modules/similarity/,
which is how the Lucene similarity implementations are manifest within
Elasticsearch.

LDA (to pick one of your items) is a hierarchical Bayesian model for a set
of documents based on latent distributions ("topics") over a vocabulary.

Lucene operates on terms; LDA operates on topics. And terms != topics.

You might find this Github project informative:
GitHub - stepthom/lucene-lda: Using latent Dirichlet allocation (LDA) in Apache Lucene

Best.
-- Paul


p...@mult.ifario.us <javascript:> | Multifarious, Inc. |
http://mult.ifario.us/

On Tue, Jul 30, 2013 at 2:27 AM, Marco Fago <fago....@gmail.com<javascript:>

wrote:

Hi Paul,

Thank you for the answer and the links.
Could please explain why it is unrelated by Elasticsearch?

M@rco

On Tuesday, July 30, 2013 10:15:41 AM UTC+2, Paul Brown wrote:

Hi, Marco --

You could use the information in the underlying Lucene indexes from
Elasticsearch to feed LSI or LDA, but implementing those techniques is
unrelated to Elasticsearch's core use case. For more background, you could
try Gensim http://radimrehurek.com/gensim/, Vowpal Wabbithttps://github.com/JohnLangford/vowpal_wabbit/wiki/Latent-Dirichlet-Allocation,
or Mallet http://mallet.cs.umass.edu/.


p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

On Tue, Jul 30, 2013 at 12:49 AM, Marco Fago fago....@gmail.com wrote:

Hi All,

I'm new of elasticsearch and some indexing techniques, but I'm very
interested in LSA,pLSA and LDA.
Basically I have no idea of how to start implementing one of them in
elasticsearch and what I should look at (source code,documentation,
plugins, whatever).
Can someone point me to a good approach to implement those algorithms?
Thank you in advance.
M@rco

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Marco --

Advanced topics are an area where you need to be self-sufficient to pursue
them, and my suggestion is that you roll up your sleeves and start with
some of the Elasticsearch and LDA tutorials. Beyond that, I can't be of
much help.


prb@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

On Wed, Jul 31, 2013 at 12:28 AM, Marco Fago fago.marco@gmail.com wrote:

Hi Paul,

Thank you again for the clear answer.
I know that I've stumbled upon a very dark (or hard) side of ES, but that
makes things funnier.
I took a loko at the **lucene-lda https://github.com/stepthom/lucene-ldaproject, which is very interesting but unfortunately does not implement the
online indexing of documents, I mean that it expects documents to be
already indexed with lda. To be honest, I'm not sure it uses mallet at all,
since I don't see any dependency from that library.
As you said, similarity is a good point to start. My idea (the only I have
at this moment) is to have a LDA indexer in the crawler, which extracts
topics and stores them in a ES document field, and uses it for boolean
usual queries. But it would not leverage all the power of LDA. About this
design, I have some concerns:
1)Given that the crawler is a river, LDA dataset grows as the crawler
collect new documents, thus old documents should be reindexed (and I don't
know how to do)
2)All the LDA logic (and storage) is pratically outside ES, and I would
like to find a way to better integrate them
3)My design does not extract topic from queries, but just use boolean
search (as a first step it is acceptable but I would like to elaborate
further this concept)

Am I on the right direction?
M@rco

On Wednesday, July 31, 2013 8:45:04 AM UTC+2, Paul Brown wrote:

Hi, Marco --

Maybe I shouldn't be so quick to be discouraging, but you're headed off
into the deepest corner of the deep end of the pool for your first swim
with Elasticsearch. That said...

Elasticsearch is a (very nice) wrapper for a Lucene. Lucene itself is an
information retrieval system that in its default configuration combines
boolean searches (for matching) with the vector space model (for scoring).
The javadocs for the Similarity and TFIDFSimilarity class are a good place
to start reading, and you can compare/contrast Elasticsearch docs on
similarityhttp://www.elasticsearch.org/guide/reference/index-modules/similarity/,
which is how the Lucene similarity implementations are manifest within
Elasticsearch.

LDA (to pick one of your items) is a hierarchical Bayesian model for a
set of documents based on latent distributions ("topics") over a vocabulary.

Lucene operates on terms; LDA operates on topics. And terms != topics.

You might find this Github project informative:
https://github.com/stepthom/**lucene-ldahttps://github.com/stepthom/lucene-lda

Best.
-- Paul

p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

On Tue, Jul 30, 2013 at 2:27 AM, Marco Fago fago....@gmail.com wrote:

Hi Paul,

Thank you for the answer and the links.
Could please explain why it is unrelated by Elasticsearch?

M@rco

On Tuesday, July 30, 2013 10:15:41 AM UTC+2, Paul Brown wrote:

Hi, Marco --

You could use the information in the underlying Lucene indexes from
Elasticsearch to feed LSI or LDA, but implementing those techniques is
unrelated to Elasticsearch's core use case. For more background, you could
try Gensim http://radimrehurek.com/gensim/, Vowpal Wabbithttps://github.com/JohnLangford/vowpal_wabbit/wiki/Latent-Dirichlet-Allocation,
or Mallet http://mallet.cs.umass.edu/.


p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

On Tue, Jul 30, 2013 at 12:49 AM, Marco Fago fago....@gmail.comwrote:

Hi All,

I'm new of elasticsearch and some indexing techniques, but I'm very
interested in LSA,pLSA and LDA.
Basically I have no idea of how to start implementing one of them in
elasticsearch and what I should look at (source code,documentation,
plugins, whatever).
Can someone point me to a good approach to implement those algorithms?
Thank you in advance.
M@rco

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.**com.

For more options, visit https://groups.google.com/**grou**ps/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.