Clustering algorithms in ElasticSearch

Hi everyone! Last month, I worked on Elasticsearch to create relevant dashboards . And now I'm actually working with "scikit-learn" library on python to cluster my dataset ( using k-means).
I wanted to know if it was possible to use ElasticSearch to implement a machine learning clustering algorithm ?
Thank you in advance :smiley:

I wanted to know if it was possible to use Elasticsearch to implement a machine learning clustering algorithm ?

The short answer is not easily.

Just to expand on this, it is possible to write something like streaming k-means as a scripted metric aggregation in Painless. I'm working on a set of data mining examples using the scripted metric aggregation as a side project, and this is one of them, but it isn't quite ready to share yet.

It is also worth mentioning that there are some shrink wrapped capabilities in the stack which might be suitable or useful depending on what exact problem you are trying to solve. The significant terms aggregation allows one to identify connections between documents, based on cooccurrence of terms, which can be used in conjunction with spectral clustering, or just used to find connected components as in the graph plugin. Also, we do have a log message categorisation capability available in the ML plugin and are exploring providing similar functionality as an aggregation. This clusters similar log messages based on an edit distance measure.

How would you want to consume the clusters? For example, would you just need the cluster centres, or to annotating documents with cluster ids in an Elasticsearch index, or something else. This would be useful to know if we were to think about providing any capabilities in this area. It would also be useful to understand a bit about the nature of the problem you are trying to solve with clustering.

1 Like

Here is the problem I'm trying to solve with clustering :

I have a sample of data on users of an automotive diagnostic tool and I am trying to classify these users into two categories (beginner users - professional users) => 2 clusters .
In a csv file, I have gathered all the variables related to the performance of a user (3 quantitative variables and a qualitative variable).
The idea is to segment these users into 2 clusters using an unsupervised learning algorithm. The final goal would be to be able to assign a cluster (among the 2 clusters) to each user. I have already done all the work with python ( using PCA then K-means and finally the elbow method).

And from this data :

image

I got something like this :

And I wanted to know if there is already a clustering algorithm implemented in Elasticsearch. Or it's way too easy to work with "scikit-learn" library rather then writing in Painless as you said.

Ok so porting all this functionality to the stack probably wouldn't happen, although writing a PCA agg has been mooted. However, just considering the setup you describe I have a couple of observations which suggest to me vanilla k-means would work:

  1. It doesn't seem like PCA would be needed for such low dimensional data,
  2. If you know you want two clusters looking for an elbow in the residual variance isn't needed. (But maybe you are also interested in how well two clusters models the data.)

One complication is it also looks like your variables are on different scales, for example typical temps moyen >> typical Nbre voitries. You must have to normalise variables in some way before clustering to get meaningful results. This would have to be a separate step if one were to do it in Elasticsearch. The matrix stats agg would be useful for doing this since you normally proceed by dividing by the variable standard deviation.

I would also add, on the face of it, this seems more like a classification problem to me, for which we do have support to train a model in the stack, but realise you probably want something unsupervised.

I wonder what the main benefits for having this in the stack would be for you. The main reason to have a k-means agg is data scale, but it sounds like doing everything in memory in a python process is enough for your use case. I guess the other reason would be you need to read and write the results from and to Elasticsearch. On this front, you should checkout both the Elasticsearch python client and also eland which provides some pandas data frame like functionality on top of Elasticsearch indices.

1 Like

Thank you very much for your remarks! According to the normalization of my data, I 've already done it before running k-means.

I have some questions on your observations:

  • You said that my data is a low dimensional data.What do you consider a small dimension? At first , I had 4 variables but after transforming the categorical variable (Anciennete) into a numeric variable ( 0 and 1 ) to be able to run k-means, the number of variables increased => the dimension increased. That's why , I used PCA .

  • On the other hand , I didn' understand how classification would work on my data ? In fact, I chose unsupervised algorithm because I have an unlabeled data.

after transforming the categorical variable (Anciennete) into a numeric variable ( 0 and 1 ) to be able to run k-means the number of variables increased => the dimension increased. That's why , I used PCA .

Right that makes sense.

I didn't understand how classification would work on my data ?

Well you would have to generate labels ;). This would be the challenging part. I guess if there was an element of human judgement which disagreed with the results of k-means such an approach could learn this. For example, you could run clustering and use clusters it found to generate training data. Followed by some manual fine tuning of these raw labels. K-means would have a linear decision boundary between the (two) clusters so even a linear model would be able to learn this and the gradient boosted trees we have in the stack could too. Of course this would only be useful to you if you somehow wanted to deploy a model (for example to assign new clusters on data as they came in in an ingest pipeline) so this may not be useful to you anyway.

I'll post here when I have a k-means agg available to share I can point you at, but as before maybe just using eland and sklearn works well for your use case.

1 Like

Yes I get your point ! Thank you a lot. I guess your observations will help me optimize my model .

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.