Document Clustering


(Michael Shapiro) #1

Hi Folks,

I've been mulling over ES' docs in order to determine if it'd fit my
document clustering needs. It doesn't particularly look like it, but I
just wanted to throw out the question in case I'm missing something.

I've got a large number of documents, many of them are quite similar
and I'd like to come up with a list of documents that are effectively
"unique". It looks like I could possibly do this with MLT queries, but
I'm not sure if I'd be trying to stuff a round peg into a square hole.

Any thoughts would be greatly appreciated!

--Mike


(Otis Gospodnetić) #2

Hi Mike,

For clustering of search results, see Carrot2 project. You may also
want to see how Carrot2 is integrated with Solr.
For off-line, batch document clustering, see Apache Mahout.

If you have document that are very, very similar, off by a few
characters or words, and want to dedupe them, you could have a look at
how it's done in Solr: http://wiki.apache.org/solr/Deduplication

Otis

Sematext is hiring -- http://sematext.com/about/jobs.html

On Nov 21, 5:14 pm, Michael Shapiro koude...@gmail.com wrote:

Hi Folks,

I've been mulling over ES' docs in order to determine if it'd fit my
document clustering needs. It doesn't particularly look like it, but I
just wanted to throw out the question in case I'm missing something.

I've got a large number of documents, many of them are quite similar
and I'd like to come up with a list of documents that are effectively
"unique". It looks like I could possibly do this with MLT queries, but
I'm not sure if I'd be trying to stuff a round peg into a square hole.

Any thoughts would be greatly appreciated!

--Mike


(Karussell) #3

yes, mlt or "fuzzy like this" could be an option (but I'm using a
customized one**). also have a look at the project otis mentioned.

Peter.

**

On 21 Nov., 23:14, Michael Shapiro koude...@gmail.com wrote:

Hi Folks,

I've been mulling over ES' docs in order to determine if it'd fit my
document clustering needs. It doesn't particularly look like it, but I
just wanted to throw out the question in case I'm missing something.

I've got a large number of documents, many of them are quite similar
and I'd like to come up with a list of documents that are effectively
"unique". It looks like I could possibly do this with MLT queries, but
I'm not sure if I'd be trying to stuff a round peg into a square hole.

Any thoughts would be greatly appreciated!

--Mike


(system) #4