Clustering data on Elasticsearch index

nfantone · June 20, 2016, 4:33am

Hi, everyone!

I created an account here because having asked a question over at SO, received absolutely no love (I even got the Thumbleweed badge for it - no kidding).

Here it is:
http://stackoverflow.com/questions/37734131/clustering-data-in-elasticsearch

Could someone give me a hand on this? It'd be much appreciated.

EDIT As requested, I'm inlining my original Stackoverflow question over here. Find its content below:

I have a rather large set of customer purchases stored in an Elasticsearch index. What I'd like to do is group customers on the set and generate a new index from that data in such a way that'd allow me to:

Differentiate unique customers.
Have aggregated information on each entry (such as sums and avg of a number of other fields).
My problem comes with the business definition I was given for "uniqueness" of customers. Two customers are considered to be same if at least 75% or their properties match (like "country", "language", "email" and so on). Properties are dynamically added during user profile creation and they might change, be added or removed in the future.

This seems closely related to how a terms filter with a minimum_should_match of 75% resolves things. So, my question is: is there a way of clustering data in Elasticsearch 2.0+ that would fit my scenario? Ideally, it would behave like a multi-bucket aggregation that would group documents if more than 3/4 of their attributes' values match each other.

EDIT: I'm not looking for manual solutions like iterating each document and query the index to retrieve similar results.

warkolm · June 20, 2016, 4:34am

It'd probably get more interest if you reposted the question too

nfantone · June 20, 2016, 4:56am

Well, the whole point was to not do that and avoid polluting the Internet, since it's just one click away. But if you think it'll help, I'll edit my original post to include it.

warkolm · June 20, 2016, 4:57am

I think you under estimate the laziness of the internet

mainec · June 20, 2016, 7:37am

I think a more like this query might be able to help you:

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html

There is a proposal for a fingerprinting inget processor here: https://github.com/elastic/elasticsearch/issues/16938 that you might find interesting as well.

Hope this helps,
Isabel

nfantone · June 20, 2016, 4:41pm

This could very well be a viable solution.

The only drawback I can think of is that this would require to know the actual doc ids and matching fields beforehand (unless the fields option accepts wildcards - something not mentioned on the docs). Am I right?

What I'd really like is to find clusters of documents matching each other, not some seed document. From the MLT doc:

Suppose we wanted to find all documents similar to a given input document. Obviously, the input document itself should be its best match for that type of query.

Then, you can also think this as a two step process in which you first find a list of every seed document (the most representative for each type), and then run an MLT query against that list.

The fingerprint processos looks promising. Already +1d it.

jprante · June 20, 2016, 9:30pm

If you want to group documents by statistics of field similarity, you need to compute the distance between documents. Each doc field stands for a feature, so your problem is a feature space topology similarity computation.

If you want to compute all groups, you must measure the distance of each document to all other documents, by iterating over the docs.

The MLT query simplifies such an approach. It can be executed for each doc to build doc groups.

The algorithm behind MLT is k-nearest-neighbor. See https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

Lucene has a builtin k-nearest-neighbor classifier http://lucene.apache.org/core/5_5_0/classification/org/apache/lucene/classification/package-summary.html but to expose that implementation in Elasticsearch is not trivial.

mainec · June 21, 2016, 9:23am

Identifying and Filtering Near-Duplicate Documents | SpringerLink this is one of the algorithms that I believe would be made easier by having fingerprinting support.

Isabel

Mark_Harwood · June 21, 2016, 11:25am

If I understand the question correctly you are talking about "entity resolution" - identifying each real-world person from a set of documents that may have described the same person in different ways.

A key question is are you doing this to:

assist users (as in the system suggests "person A might be person B - please confirm") or
automate some batch data de-duplication process with minimal human intervention.

Fuzzy matching rules like MLT are useful for scenario #1 (emphasis being on recall vs precision) but scenario #2 requires more rigour and emphasises precision over recall. Each merge operation has to be something that can be trusted for the algorithm to iterate on without human intervention. A person cannot be allowed to over-link and become his brother then their father etc through a steady accumulation of weakly linked properties.
Large-scale iterative entity resolution can be achieved but uses non-fuzzy keys and lots of different ways of composing the keys. I demoed this 37 minutes into this presentation at elasticon: https://www.elastic.co/elasticon/conf/2016/sf/graph-capabilities-in-the-elastic-stack

Topic		Replies	Views
Document Clustering Elasticsearch	3	1171	July 6, 2017
Elastic search and data clustering/grouping Elasticsearch	2	929	July 6, 2017
Deduplicating data in ElasticSearch Elasticsearch	2	703	September 12, 2017
In Elasticsearch, is possible to cluster documents that share the most similar texts, without giving an initial query to compare to? Elasticsearch	3	3648	July 25, 2017
Finding one document of every similar documents group Elasticsearch	3	725	July 6, 2017

Clustering data on Elasticsearch index

Related topics