Hello,
Is there any aggregation that group documents in terms of closeness of dense vectors?
My documents have the following structure:
{
"label": "value",
"vector": [1, 0, 3, 0, 0, 0, 18, 0, 0, 0, ...] # N dimensions
}
What I want is to find clusters of documents whose vectors are "close" in a N-dimensional space.
I've been looking into the forum and the documentation and found a "Variable Width Histogram Aggregation" (https://www.elastic.co/guide/en/elasticsearch/reference/7.x/search-aggregations-bucket-variablewidthhistogram-aggregation.html) that looks like it does something similar to what I want, but it is not available in versions 7.7-7.8.
I should also highlight that I structured the information in this way because I think it should be easier for applying clustering from a ML perspective, but I could use any other data scheme. My final goal is to group IP addresses in different clusters depending on the attack they are performing.
What I actually have is a bunch of events from my firewalls with an origin IP address that is performing an attack and the kind of alert it triggered. What I did is grouping all the events triggered by each IP address in a single document (using a transform) and count the number of events that IP address generated per type of alert, summarizing them into a dense vector.
So, each dimension in the final vector is the number of events of 1 certain type that 1 IP address generated. Obviously, the same position in different vectors represents exactly the same type of alert.
Then, I applyed clustering algorithms, such as DBSCAN, over these vectors outside Elasticsearch and the result was that I got groups of IP addresses that were performing the same kind of attacks. My next step would be to do exactly the same but inside Elasticsearch.
Is that possible? Would it be another (and better) way to do it?
Thanks in advance!!