I created an account here because having asked a question over at SO, received absolutely no love (I even got the Thumbleweed badge for it - no kidding).
Here it is:
Could someone give me a hand on this? It'd be much appreciated.
EDIT As requested, I'm inlining my original Stackoverflow question over here. Find its content below:
I have a rather large set of customer purchases stored in an Elasticsearch index. What I'd like to do is group customers on the set and generate a new index from that data in such a way that'd allow me to:
Differentiate unique customers.
Have aggregated information on each entry (such as sums and avg of a number of other fields).
My problem comes with the business definition I was given for "uniqueness" of customers. Two customers are considered to be same if at least 75% or their properties match (like "country", "language", "email" and so on). Properties are dynamically added during user profile creation and they might change, be added or removed in the future.
This seems closely related to how a terms filter with a minimum_should_match of 75% resolves things. So, my question is: is there a way of clustering data in Elasticsearch 2.0+ that would fit my scenario? Ideally, it would behave like a multi-bucket aggregation that would group documents if more than 3/4 of their attributes' values match each other.
EDIT: I'm not looking for manual solutions like iterating each document and query the index to retrieve similar results.
It'd probably get more interest if you reposted the question too
Well, the whole point was to not do that and avoid polluting the Internet, since it's just one click away. But if you think it'll help, I'll edit my original post to include it.
I think you under estimate the laziness of the internet
I think a more like this query might be able to help you:
There is a proposal for a fingerprinting inget processor here: https://github.com/elastic/elasticsearch/issues/16938 that you might find interesting as well.
Hope this helps,
This could very well be a viable solution.
The only drawback I can think of is that this would require to know the actual doc ids and matching fields beforehand (unless the
fields option accepts wildcards - something not mentioned on the docs). Am I right?
What I'd really like is to find clusters of documents matching each other, not some seed document. From the MLT doc:
Suppose we wanted to find all documents similar to a given input document. Obviously, the input document itself should be its best match for that type of query.
Then, you can also think this as a two step process in which you first find a list of every seed document (the most representative for each type), and then run an MLT query against that list.
The fingerprint processos looks promising. Already +1d it.
If you want to group documents by statistics of field similarity, you need to compute the distance between documents. Each doc field stands for a feature, so your problem is a feature space topology similarity computation.
If you want to compute all groups, you must measure the distance of each document to all other documents, by iterating over the docs.
The MLT query simplifies such an approach. It can be executed for each doc to build doc groups.
The algorithm behind MLT is k-nearest-neighbor. See https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
Lucene has a builtin k-nearest-neighbor classifier http://lucene.apache.org/core/5_5_0/classification/org/apache/lucene/classification/package-summary.html but to expose that implementation in Elasticsearch is not trivial.
http://link.springer.com/chapter/10.1007/3-540-45123-4_1 this is one of the algorithms that I believe would be made easier by having fingerprinting support.
If I understand the question correctly you are talking about "entity resolution" - identifying each real-world person from a set of documents that may have described the same person in different ways.
A key question is are you doing this to:
assist users (as in the system suggests "person A might be person B - please confirm") or
automate some batch data de-duplication process with minimal human intervention.
Fuzzy matching rules like MLT are useful for scenario #1 (emphasis being on recall vs precision) but scenario #2 requires more rigour and emphasises precision over recall. Each merge operation has to be something that can be trusted for the algorithm to iterate on without human intervention. A person cannot be allowed to over-link and become his brother then their father etc through a steady accumulation of weakly linked properties.
Large-scale iterative entity resolution can be achieved but uses non-fuzzy keys and lots of different ways of composing the keys. I demoed this 37 minutes into this presentation at elasticon: https://www.elastic.co/elasticon/conf/2016/sf/graph-capabilities-in-the-elastic-stack