Document Clustering

Michael_Shapiro · November 21, 2011, 10:14pm

Hi Folks,

I've been mulling over ES' docs in order to determine if it'd fit my
document clustering needs. It doesn't particularly look like it, but I
just wanted to throw out the question in case I'm missing something.

I've got a large number of documents, many of them are quite similar
and I'd like to come up with a list of documents that are effectively
"unique". It looks like I could possibly do this with MLT queries, but
I'm not sure if I'd be trying to stuff a round peg into a square hole.

Any thoughts would be greatly appreciated!

--Mike

otisg · November 22, 2011, 5:27am

Hi Mike,

For clustering of search results, see Carrot2 project. You may also
want to see how Carrot2 is integrated with Solr.
For off-line, batch document clustering, see Apache Mahout.

If you have document that are very, very similar, off by a few
characters or words, and want to dedupe them, you could have a look at
how it's done in Solr: Deduplication - Solr - Apache Software Foundation

Otis

Sematext is hiring -- Jobs - Sematext

On Nov 21, 5:14 pm, Michael Shapiro koude...@gmail.com wrote:

Hi Folks,

I've been mulling over ES' docs in order to determine if it'd fit my
document clustering needs. It doesn't particularly look like it, but I
just wanted to throw out the question in case I'm missing something.

I've got a large number of documents, many of them are quite similar
and I'd like to come up with a list of documents that are effectively
"unique". It looks like I could possibly do this with MLT queries, but
I'm not sure if I'd be trying to stuff a round peg into a square hole.

Any thoughts would be greatly appreciated!

--Mike

Karussell1 · November 23, 2011, 8:02am

yes, mlt or "fuzzy like this" could be an option (but I'm using a
customized one**). also have a look at the project otis mentioned.

Peter.

**

github.com

karussell/Jetwick/blob/master/src/main/java/de/jetwick/es/ElasticTweetSearch.java#L1005


      
                  sb.append(separator);
                  sb.append(tweet.getRetweetCount());
                  sb.append(separator);
                  sb.append(tweet.getText().replaceAll("\n", " "));
                  sb.append("\n");
              }
          
              return sb.toString();
          }
          
          public Collection<JTweet> findDuplicates(Map<Long, JTweet> tweets) {
              final Set<JTweet> updatedTweets = new LinkedHashSet<JTweet>();
              TermCreateCommand termCommand = new TermCreateCommand();
              double JACC_BORDER = 0.7;
              for (JTweet currentTweet : tweets.values()) {
                  if (currentTweet.isRetweet())
                      continue;
          
                  JetwickQuery reqBuilder = new SimilarTweetQuery(currentTweet, false).addLatestDateFilter(24);
                  if (currentTweet.getTextTerms().size() < 3)
                      continue;

On 21 Nov., 23:14, Michael Shapiro koude...@gmail.com wrote:

Hi Folks,

I've been mulling over ES' docs in order to determine if it'd fit my
document clustering needs. It doesn't particularly look like it, but I
just wanted to throw out the question in case I'm missing something.

I've got a large number of documents, many of them are quite similar
and I'd like to come up with a list of documents that are effectively
"unique". It looks like I could possibly do this with MLT queries, but
I'm not sure if I'd be trying to stuff a round peg into a square hole.

Any thoughts would be greatly appreciated!

--Mike

Topic		Replies	Views
Clustering Data Elasticsearch	2	1534	July 5, 2017
In Elasticsearch, is possible to cluster documents that share the most similar texts, without giving an initial query to compare to? Elasticsearch	3	3624	July 25, 2017
Clustering data on Elasticsearch index Elasticsearch	9	4311	July 5, 2017
Evaluating ES Questions Elasticsearch	5	313	July 6, 2017
Evaluating ES Questions Elasticsearch	1	237	July 6, 2017

Document Clustering

Otis

Related topics