Best way to index/store array


(YG) #1

Hello

We are trying to store a fixed size floating vector for each entity in ES. For example, one vector can look like [0.2 0.4 0.8 0.1 0.05]. We want this vector to keep its element's order, so we can compare them in the native scripts. We have identified the following ways, some might not work, and we want to get some opinion which one is the best.

  1. Directly store the vector as a double vector. In native scripts, the values of the array will be sorted, and deduped if we directly use doc() to retrieve the values. So we can retrieve them using source(). However, this approach is very slow as retrieving from source() will load data from disk.

  2. Put them as a key:payload pair. Store these values as 1:0,2 2:0.4 3:0.8 ... and use 1 2 3 4 as keys to index these values, and the vector values are stored as payload. In the native scripts, we are allowed to load these values by accessing payload values. This is in memory, but each time, we load it, we will need to decode base64 strings for each float value.

  3. Use nested objects, store each key and value in a nested object. Based on some discussions online, this seems to keep the order of these key/value pairs? But in practice, we found the order of keys and values are still not kept? Is there a way to preserve the key/value pairs?

  4. Directly encode the vector to a base64 string, and decode it in native scripts on the fly. This approach needs additional cost for decoding, but it seems base64 decoding is relatively cheap. Comparing to (2), which also needs base64 decoding, which way is better?

  5. Hard code ES data schema as field1 field2 ...fieldN. And for each field, we can store the values in each field. This will be efficient and should definitely work. However this is a very bad design, and cannot be modified with various length.

Could you please give us some ideas and comments on what is the best way to store a vector, and keep it in order? So we can easily and quickly access it from native scripts in memory?

Thanks!


(Jörg Prante) #2

I think the best method for storing a vector will be using doc values: https://www.elastic.co/blog/disk-based-field-data-a-k-a-doc-values

Numeric doc value fields when being multivalued are not processed as a sorted set: https://github.com/elastic/elasticsearch/issues/3993

Never tried it but it should work IMHO.


(YG) #3

Thanks! Does this mean the values will be stored in disk? We are using that for reranking, so we strongly prefer to put all data in memory. Also, I tried the doc value, and found it did not perform dedup anymore, but do sorts the data. Maybe I am not doing it correctly.


(YG) #4

Any other comments/ideas?


(system) #5