Hi guys,
I want to run K-Nearest-Nighbors on feature-vectors stored at ES.
I wrote a plugin based on this plugin: https://github.com/MLnick/elasticsearch-vector-scoring
I'm seeking to improve the query performance. right now it takes ~1 second - I wish to make it 10 times faster.
Details:
I have 2M documents in the index
each document contains a 64 dimensions floats vector in a field named "embedding"
this is it's mapping:
"analysis": {
"analyzer": {
"payload_analyzer": {
"filter": "delimited_payload_filter",
"tokenizer": "whitespace",
"type": "custom"
}
}
}
"mappings": {
"properties": {
"embedding": {
"analyzer": "payload_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
}
}
}
This is the search query:
{
"size": 100,
"query": {
"function_score": {
"boost_mode": "replace",
"script_score": {
"script": "payload_vector_score",
"lang": "native",
"params": {
"field": "embedding",
"cosine" : false,
"vector": [-0.06555712223052979 ,0.0639316588640213 ,-0.1625019609928131 ,-0.049717679619789124 ,-0.08388650417327881 ,-0.05376458540558815 ,-0.21441558003425598 ,0.14069288969039917 ,0.028580941259860992 ,0.07442957907915115 ,-0.19108714163303375 ,-0.10003119707107544 ,0.034126054495573044 ,-0.11807726323604584 ,0.04761182889342308 ,0.004601459950208664 ,-0.12167082726955414 ,0.2301076203584671 ,-0.005734231788665056 ,0.016479089856147766 ,0.025114329531788826 ,-0.015090115368366241 ,0.005890047177672386 ,-0.04142259433865547 ,0.15503185987472534 ,0.09912215173244476 ,0.1551043689250946 ,0.14985895156860352 ,0.2064201831817627 ,-0.1238853856921196 ,0.04467460513114929 ,-0.061931200325489044 ,-0.04865756630897522 ,-0.009241082705557346 ,-0.19579431414604187 ,0.21952545642852783 ,0.1435101181268692 ,-0.2241126447916031 ,0.08423150330781937 ,-0.11718004941940308 ,0.01940910331904888 ,-0.09160779416561127 ,0.1686438024044037 ,0.1839606910943985 ,0.1823773831129074 ,0.07107185572385788 ,0.1360888034105301 ,0.21161314845085144 ,-0.009615485556423664 ,0.08052477240562439 ,-0.1621086150407791 ,-0.037252187728881836 ,-0.0528680719435215 ,-0.07718119770288467 ,-0.05522914603352547 ,-0.24222344160079956 ,0.052051275968551636 ,-0.10451067239046097 ,0.09648159146308899 ,0.11125080287456512 ,-0.2878655791282654 ,-0.10746297240257263 ,0.04359650984406471 ,0.11088574677705765]
}
}
}
}
}
I tried several other ways to achieve this, but al were slower/failed:
Try 1 - failed
Store the vector as an array of floats.
it failed - since ES index does not keep the order of the elements in the vector. and using _source was (of course) very slow
Try 2 - slower than the above
Store the vector as comma separated string
wrote a scoring plugin which converts the string to a vector in a runtime.
Am I approaching this wrong? is there any optimization that can be done?
Thanks in advance,
Lior