Hey,
I'm making lots of great use of the dense_vector
field type, but I've hit a bit of a wall now. Is there any way I can get the "average" from a vector field?
Context
My real use-case is a bit more complex, but imagine I have this mapping for emails with a spam flag:
PUT /emails
{
"mappings": {
"properties": {
"text": {
"type": "text"
},
"text_vector": {
"type": "dense_vector",
"dims": 300
},
"is_spam": {
"type": "boolean"
}
}
}
}
I've vectorized and ingested my data, so I now have 150,000 docs that each have 300-dimension vectors associated with them.
Now, I want to do a vector similarity search (using cosineSimilarity
) to find documents similar to those which I've flagged as spam, but which are not yet flagged.
With a query vector in-hand, this is made easy by the vector function scoring functionality, and I can quickly retrieve the most similar documents.
POST /emails/_search
{
"query": {
"script_score": {
"query" : {
"bool" : {
"must_not" : {
"term" : {
"is_spam" : true
}
}
}
},
"script": {
"source": "cosineSimilarity(params.query_vector, 'text_vector') + 1.0",
"params": {
"query_vector": [...] // this is lengthy
}
}
}
}
}
But I need a way to build out that query_vector
initially. This works well if I have seed text, or a seed document, because I can calculate the vector (or fetch it from the source) as a prerequisite, but when I want to cover a collection of documents, it gets more difficult.
As an initial stab, I want to try and get the "center" of the flagged documents. To find this, I simply need the item-wise mean vector.
So if I have these documents:
1: { ..., "text_vector": [ 1, 2, 3 ], ... },
2: { ..., "text_vector": [ 2, 1, 3 ], ... },
3: { ..., "text_vector": [ 0, 5, 1 ], ... }
Then I can just sum up each array index, then divide by three:
sum[0] = 1 + 2 + 0 = 3
sum[1] = 2 + 1 + 5 = 8
sum[2] = 3 + 3 + 1 = 7
avg[0] = 3 / 3 = 1
avg[1] = 8 / 3 = 2.67
avg[2] = 7 / 3 = 2.33
And so, I pass in "query_vector": [ 1, 2.67, 2.33 ]
to the above query, and retrieve the documents I'm looking for.
Current approach
Right now, I start with a paginated or scroll query to get back the document sources, then build up the sums and averages procedurally. This is fine for a few documents, but it can be quite troublesome when looking at more and more. At the very least, there's a lot of network traffic required to read out 300 dimensions from several documents.
Desired approach
I'd love to be able to use the avg
aggregation to get these values handed to me. Obviously, a single aggregation isn't going to give me back an array result, but I would much prefer something like:
POST /emails/_search
{
"size": 0,
"query": {
"bool" : {
"must_not" : {
"term" : {
"is_spam" : true
}
}
}
},
"aggs": {
"text_vector_0": {
"avg": {
"field": "text_vector.0"
}
},
"text_vector_1": {
"avg": {
"field": "text_vector.1"
}
},
...
"text_vector_299": {
"avg": {
"field": "text_vector.299"
}
},
}
}
But of course, as expected, those fields don't exist.
Options
Read the values in a script
I had thought to use script aggregations, but any attempt to read elements from the dense_vector
field return the error:
unsupported_operation_exception: accessing a vector field's value through 'get' or 'value' is not supported
Use the dot product to retrieve values
I came up with another (I think, clever) idea to use the dotProduct
function with identity matrix slices, but I quickly found that dotProduct
, cosineSimilarity
, and other vector functions are only available in the score context.
So now the best I've got is a multi-search that aggregates score for the query vectors individually:
POST /_msearch=responses.aggregations.text_vector_*.value
{ "index": "emails" }
{ "size": 0, "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "dotProduct(params.query_vec, 'text_vector') + 1.0", "params": { "query_vec": [1, 0, 0] } } } }, "aggs": { "text_vector_0": { "avg": { "script": { "source": "return _score.floatValue() - 1;" } } } } }
{ "index": "emails" }
{ "size": 0, "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "dotProduct(params.query_vec, 'text_vector') + 1.0", "params": { "query_vec": [0, 1, 0] } } } }, "aggs": { "text_vector_1": { "avg": { "script": { "source": "return _score.floatValue() - 1;" } } } } }
{ "index": "emails" }
{ "size": 0, "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "dotProduct(params.query_vec, 'text_vector') + 1.0", "params": { "query_vec": [0, 0, 1] } } } }, "aggs": { "text_vector_2": { "avg": { "script": { "source": "return _score.floatValue() - 1;" } } } } }
...
But like, that sucks. Real bad. Remember, I've got 300-dimension vectors, which adds searches to this, and lengthens the query vectors themselves.
Store the vectors separately
This is the obvious choice, but has its own challenges. In my real use-case, I'm dealing with more than one dense vector. So I have to multiply work multiple times.
I also can't use copy_to
because of how it works with arrays, so I either have to start using ingestion pipelines, or do it as part of my procedural code.
And even once I have a copy, the doc values .get(0)
doesn't guarantee order, so that doesn't work.
What I have to do, it seems, is store the vectors as objects, adding a dynamic template to the above mapping like:
PUT /emails
{
"mappings": {
"dynamic": "strict",
"dynamic_templates": [
{
"raw_vector": {
"path_match": "text_vector_obj.*",
"mapping": {
"type": "float"
}
}
}
],
"properties": {
"text": {
"type": "text"
},
"text_vector": {
"type": "dense_vector",
"dims": 300
},
"is_spam": {
"type": "boolean"
},
"text_vector_obj": {
"type": "object",
"dynamic": true
}
}
}
}
I'm then able to add rows as arrays and dictionaries.
3: { ..., "text_vector": [ 0, 5, 1 ], "text_vector_obj": { "0": 0, "1": 5, "3": 1 }, ... }
This works, and indeed makes use of the avg
aggregation with a simple field
argument quite trivial, but it's tremendously clunky for the ingestion, not to mention how it requires values to be stored in multiple fields.
Alternatives?
So, are there any alternatives out there? I recognize that I've covered a couple dead-ends here that could make feature requests over on GitHub, but I wanted to sanity check my options before going to dev. I wonder if there's a cleaner way to be performing this.