I've been mulling over ES' docs in order to determine if it'd fit my
document clustering needs. It doesn't particularly look like it, but I
just wanted to throw out the question in case I'm missing something.
I've got a large number of documents, many of them are quite similar
and I'd like to come up with a list of documents that are effectively
"unique". It looks like I could possibly do this with MLT queries, but
I'm not sure if I'd be trying to stuff a round peg into a square hole.
For clustering of search results, see Carrot2 project. You may also
want to see how Carrot2 is integrated with Solr.
For off-line, batch document clustering, see Apache Mahout.
If you have document that are very, very similar, off by a few
characters or words, and want to dedupe them, you could have a look at
how it's done in Solr: Deduplication - Solr - Apache Software Foundation
I've been mulling over ES' docs in order to determine if it'd fit my
document clustering needs. It doesn't particularly look like it, but I
just wanted to throw out the question in case I'm missing something.
I've got a large number of documents, many of them are quite similar
and I'd like to come up with a list of documents that are effectively
"unique". It looks like I could possibly do this with MLT queries, but
I'm not sure if I'd be trying to stuff a round peg into a square hole.
I've been mulling over ES' docs in order to determine if it'd fit my
document clustering needs. It doesn't particularly look like it, but I
just wanted to throw out the question in case I'm missing something.
I've got a large number of documents, many of them are quite similar
and I'd like to come up with a list of documents that are effectively
"unique". It looks like I could possibly do this with MLT queries, but
I'm not sure if I'd be trying to stuff a round peg into a square hole.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.