I'm trying to index an array of floating point values representing features, and be able to perform more_like_this queries to find similar documents that share similar floating point features in a multidimensional array of features. My features are something like 300 dimensions of floating point values, and their order matters.
To try and find similar features, I have a mapping that includes a text type to store the features. I analyze it with a custom analyzer, a whitespace tokenizer, and the filters: ["delimited_payload_filter", "min_hash_filter"]. My hope is that this can take a text type such as: "["0.18 -0.01 0.37 -0.24 -0.03 0.21 0.20 0.19 0.06", "-0.04 -0.20 0.18 0.09 -0.06 -0.29 -0.19", ...]", tokenize it with the positions and values of each element using the delimited_payload_filter, e.g.: "0.18|1 -0.01|2 ...", then hash these w/ min_hash using buckets/hashes so that I get a meaningful locality sensitive hash that is able to index the values of features. I'm able to use the more_like_this query on this type just fine, but not entirely sure what is happening.
When I use explain=true, I'm getting a lot of information on the min_hash value it matched, but what I'm not positive about is whether it is really using the positions of the feature values it finds as part of the term vector, or if it is only using the token itself? In other words, is min_hash using "0.18|1" as a token, or "0.18"?
Any help much appreciated!