Getting aggregate statistics about a dense_vector field

Hey,

I'm making lots of great use of the dense_vector field type, but I've hit a bit of a wall now. Is there any way I can get the "average" from a vector field?

Context

My real use-case is a bit more complex, but imagine I have this mapping for emails with a spam flag:

PUT /emails
{
    "mappings": {
        "properties": {
            "text": {
                "type": "text"
            },
            "text_vector": {
                "type": "dense_vector",
                "dims": 300
            },
            "is_spam": {
                "type": "boolean"
            }
        }
    }
}

I've vectorized and ingested my data, so I now have 150,000 docs that each have 300-dimension vectors associated with them.

Now, I want to do a vector similarity search (using cosineSimilarity) to find documents similar to those which I've flagged as spam, but which are not yet flagged.

With a query vector in-hand, this is made easy by the vector function scoring functionality, and I can quickly retrieve the most similar documents.

POST /emails/_search
{
    "query": {
        "script_score": {
            "query" : {
                "bool" : {
                    "must_not" : {
                        "term" : {
                            "is_spam" : true
                        }
                    }
                }
            },
            "script": {
                "source": "cosineSimilarity(params.query_vector, 'text_vector') + 1.0", 
                "params": {
                    "query_vector": [...] // this is lengthy
                }
            }
        }
    }
}

But I need a way to build out that query_vector initially. This works well if I have seed text, or a seed document, because I can calculate the vector (or fetch it from the source) as a prerequisite, but when I want to cover a collection of documents, it gets more difficult.

As an initial stab, I want to try and get the "center" of the flagged documents. To find this, I simply need the item-wise mean vector.

So if I have these documents:

1: { ..., "text_vector": [ 1, 2, 3 ], ... },
2: { ..., "text_vector": [ 2, 1, 3 ], ... },
3: { ..., "text_vector": [ 0, 5, 1 ], ... }

Then I can just sum up each array index, then divide by three:

sum[0] = 1 + 2 + 0 = 3
sum[1] = 2 + 1 + 5 = 8
sum[2] = 3 + 3 + 1 = 7

avg[0] = 3 / 3 = 1
avg[1] = 8 / 3 = 2.67
avg[2] = 7 / 3 = 2.33

And so, I pass in "query_vector": [ 1, 2.67, 2.33 ] to the above query, and retrieve the documents I'm looking for.

Current approach

Right now, I start with a paginated or scroll query to get back the document sources, then build up the sums and averages procedurally. This is fine for a few documents, but it can be quite troublesome when looking at more and more. At the very least, there's a lot of network traffic required to read out 300 dimensions from several documents.

Desired approach

I'd love to be able to use the avg aggregation to get these values handed to me. Obviously, a single aggregation isn't going to give me back an array result, but I would much prefer something like:

POST /emails/_search
{
    "size": 0,
    
    "query": {
        "bool" : {
            "must_not" : {
                "term" : {
                    "is_spam" : true
                }
            }
        }
    },

    "aggs": {
        "text_vector_0": {
            "avg": {
                "field": "text_vector.0"
            }
        },
        "text_vector_1": {
            "avg": {
                "field": "text_vector.1"
            }
        },
        ...
        "text_vector_299": {
            "avg": {
                "field": "text_vector.299"
            }
        },
    }
}

But of course, as expected, those fields don't exist.

Options

Read the values in a script

I had thought to use script aggregations, but any attempt to read elements from the dense_vector field return the error:

unsupported_operation_exception: accessing a vector field's value through 'get' or 'value' is not supported

Use the dot product to retrieve values

I came up with another (I think, clever) idea to use the dotProduct function with identity matrix slices, but I quickly found that dotProduct, cosineSimilarity, and other vector functions are only available in the score context.

So now the best I've got is a multi-search that aggregates score for the query vectors individually:

POST /_msearch=responses.aggregations.text_vector_*.value
{ "index": "emails" }
{ "size": 0, "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "dotProduct(params.query_vec, 'text_vector') + 1.0", "params": { "query_vec": [1, 0, 0] } } } }, "aggs": { "text_vector_0": { "avg": { "script": { "source": "return _score.floatValue() - 1;" } } } } }
{ "index": "emails" }
{ "size": 0, "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "dotProduct(params.query_vec, 'text_vector') + 1.0", "params": { "query_vec": [0, 1, 0] } } } }, "aggs": { "text_vector_1": { "avg": { "script": { "source": "return _score.floatValue() - 1;" } } } } }
{ "index": "emails" }
{ "size": 0, "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "dotProduct(params.query_vec, 'text_vector') + 1.0", "params": { "query_vec": [0, 0, 1] } } } }, "aggs": { "text_vector_2": { "avg": { "script": { "source": "return _score.floatValue() - 1;" } } } } }
...

But like, that sucks. Real bad. Remember, I've got 300-dimension vectors, which adds searches to this, and lengthens the query vectors themselves.

Store the vectors separately

This is the obvious choice, but has its own challenges. In my real use-case, I'm dealing with more than one dense vector. So I have to multiply work multiple times.

I also can't use copy_to because of how it works with arrays, so I either have to start using ingestion pipelines, or do it as part of my procedural code.

And even once I have a copy, the doc values .get(0) doesn't guarantee order, so that doesn't work.

What I have to do, it seems, is store the vectors as objects, adding a dynamic template to the above mapping like:

PUT /emails
{
    "mappings": {
        "dynamic": "strict",
        "dynamic_templates": [
            {
                "raw_vector": {
                    "path_match": "text_vector_obj.*",
                    "mapping": {
                        "type": "float"
                    }
                }
            }
        ],
        "properties": {
            "text": {
                "type": "text"
            },
            "text_vector": {
                "type": "dense_vector",
                "dims": 300
            },
            "is_spam": {
                "type": "boolean"
            },
            "text_vector_obj": {
                "type": "object",
                
                "dynamic": true
            }
        }
    }
}

I'm then able to add rows as arrays and dictionaries.

3: { ..., "text_vector": [ 0, 5, 1 ], "text_vector_obj": { "0": 0, "1": 5, "3": 1 }, ... }

This works, and indeed makes use of the avg aggregation with a simple field argument quite trivial, but it's tremendously clunky for the ingestion, not to mention how it requires values to be stored in multiple fields.

Alternatives?

So, are there any alternatives out there? I recognize that I've covered a couple dead-ends here that could make feature requests over on GitHub, but I wanted to sanity check my options before going to dev. I wonder if there's a cleaner way to be performing this.

Worth noting, the cluster I'm on happens to be 7.8, but it's not a huge deal if I have to update it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.