Hi Paul,
Thank you again for the clear answer.
I know that I've stumbled upon a very dark (or hard) side of ES, but that
makes things funnier.
I took a loko at the lucene-lda https://github.com/stepthom/lucene-ldaproject, which is very interesting but unfortunately does not implement the
online indexing of documents, I mean that it expects documents to be
already indexed with lda. To be honest, I'm not sure it uses mallet at all,
since I don't see any dependency from that library.
As you said, similarity is a good point to start. My idea (the only I have
at this moment) is to have a LDA indexer in the crawler, which extracts
topics and stores them in a ES document field, and uses it for boolean
usual queries. But it would not leverage all the power of LDA. About this
design, I have some concerns:
1)Given that the crawler is a river, LDA dataset grows as the crawler
collect new documents, thus old documents should be reindexed (and I don't
know how to do)
2)All the LDA logic (and storage) is pratically outside ES, and I would
like to find a way to better integrate them
3)My design does not extract topic from queries, but just use boolean
search (as a first step it is acceptable but I would like to elaborate
further this concept)
Am I on the right direction?
M@rco
On Wednesday, July 31, 2013 8:45:04 AM UTC+2, Paul Brown wrote:
Hi, Marco --
Maybe I shouldn't be so quick to be discouraging, but you're headed off
into the deepest corner of the deep end of the pool for your first swim
with Elasticsearch. That said...
Elasticsearch is a (very nice) wrapper for a Lucene. Lucene itself is an
information retrieval system that in its default configuration combines
boolean searches (for matching) with the vector space model (for scoring).
The javadocs for the Similarity and TFIDFSimilarity class are a good place
to start reading, and you can compare/contrast Elasticsearch docs on
similarityhttp://www.elasticsearch.org/guide/reference/index-modules/similarity/,
which is how the Lucene similarity implementations are manifest within
Elasticsearch.
LDA (to pick one of your items) is a hierarchical Bayesian model for a set
of documents based on latent distributions ("topics") over a vocabulary.
Lucene operates on terms; LDA operates on topics. And terms != topics.
You might find this Github project informative:
GitHub - stepthom/lucene-lda: Using latent Dirichlet allocation (LDA) in Apache Lucene
Best.
-- Paul
—
p...@mult.ifario.us <javascript:> | Multifarious, Inc. |
http://mult.ifario.us/
On Tue, Jul 30, 2013 at 2:27 AM, Marco Fago <fago....@gmail.com<javascript:>
wrote:
Hi Paul,
Thank you for the answer and the links.
Could please explain why it is unrelated by Elasticsearch?
M@rco
On Tuesday, July 30, 2013 10:15:41 AM UTC+2, Paul Brown wrote:
Hi, Marco --
You could use the information in the underlying Lucene indexes from
Elasticsearch to feed LSI or LDA, but implementing those techniques is
unrelated to Elasticsearch's core use case. For more background, you could
try Gensim http://radimrehurek.com/gensim/, Vowpal Wabbithttps://github.com/JohnLangford/vowpal_wabbit/wiki/Latent-Dirichlet-Allocation,
or Mallet http://mallet.cs.umass.edu/.
—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
On Tue, Jul 30, 2013 at 12:49 AM, Marco Fago fago....@gmail.com wrote:
Hi All,
I'm new of elasticsearch and some indexing techniques, but I'm very
interested in LSA,pLSA and LDA.
Basically I have no idea of how to start implementing one of them in
elasticsearch and what I should look at (source code,documentation,
plugins, whatever).
Can someone point me to a good approach to implement those algorithms?
Thank you in advance.
M@rco
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.