In Elasticsearch, is possible to cluster documents that share the most similar texts, without giving an initial query to compare to?

In Elasticsearch, is possible to group documents that share the most similar texts, without giving an initial query to compare to?

I know is possible to query and get "more like this document" but, is possible to cluster documents within an index according to a field values?

For instance:

document 1: The quick brown fox jumps over the lazy dog

document 2: Barcelona is a great city

document 3: The fast orange fox jumps over the lazy dog

document 4: Madrid is a great city

document 5: I do not like to eat fish

Now, perform some kind of aggregation that, without giving a search query, it can group:

Group 1: document 1 and document 3

Group 2: document 2 and document 4

Group 3: document 5

I will really appreciate any clue!

There is not currently an aggregation which performs clustering. There is an issue for adding k-means clustering as an aggregation (https://github.com/elastic/elasticsearch/issues/5512) and I played around with a prototype for this a while ago but there are some changes that would need to be made to the aggregations framework itself to support this kind of aggregation and that work is yet to be done.

Ohhh, good and sad to know then, thanks a lot Colin.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.