I am not an expert in elastic queries, I have not found a solution to filter my results. My index contains 450,000 documents. The issue is that when I perform a search, it always returns all 450,000 documents, sorted by relevance. However, upon examining the last results, some of them do not even match my query. Therefore, I am considering the idea of limiting and filtering the results only for the documents with a similarity score greater than 0.5. This way, we can ensure that we will only get relevant results.
mappings of field :
"vector": {
"type": "dense_vector",
"dims": 768,
"index": false
},
Something that may be of help in deciding what score threshold to pick:
If you have many category fields (eg department:kitchen utensils) you can use these to see what score ranges produce a bewildering number of values (indicating random rather than cohesive set of concepts)
In this visualisation query score is on the x-axis and the breakdown of document categories matching that score band is in the vertical bar. The high-scoring documents come from a small selection of related categories while the lower scoring documents come from a huge number of unrelated categories. In a way the number of categories provide a measure of how many different meanings the matches in a range have and give a reasonable indication of where there are meaningless results in the score bands.
elastiknn_dense_float_vector isn't officially supported by Elasticsearch, and I am not even sure its currently maintained.
If you want approximate nearest neighbors, you should try dense_vector with index: true. That allows you to set expected similarity thresholds at query time.
Or you can continue to use script_query and set min_score for that specific query.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.