I am trying to implement a customized search function that will rank documents based on sparse features. Concretely, each document will have a list of sparse features, e.g.
Now at the query stage, the query will have a list of the sparse dimension that appears, e.g. [a, zz, b]. My scoring function is for each doc is simply added up all the value of terms that appear in both the query and the document. Take the above example,
doc_1_score = 0.2 + 1.2
doc_2_score = 1.1 + 0.2
My question is what is the most efficient way to implement this, I have two ideas now.
using rank features to save all the values in documents, and then using a list of "should" to get the score
save the features as sparse vectors, then using a dot_product to get the final score.
Which one do you think will be more efficient (memory & speed). Is there better way to accomplish this, e.g. using inverted index? Thank you!
Thanks for the reply! What is considered as reasonable for boolean query features? Usually I will have 20-100 features in the query. Is that considered okay? Thanks!
(Newbie here) do you think this type of Boolean query that I am Interested can scale and maintain high speed for large index? I have more than 10-50 million documents in the index.
10-50 million docs is not a very big collection, but the performance of a query depends on many factors: for a single clause how many docs contain a particular feature, how many clauses you have in total etc. A good thing with rank features query is that it can efficiently skip non-competitive documents if you just need top N docs and don't need total hit count.
So the best advice for you is to index your collection and test the performance of your queries yourself.
Thank you! I tested myself and the speed is not bad. Is there a place I can read more about how rank features are implemented? E.g what kind of data structure and how it skip non competitive docs etc.
Hi I have implemented the rank_feature as your suggested and it works great in general. I am using log-based rank_feature with scaling factor = 1. I.e. y = log(x + 1)
one thing I notice that, the return score sometime is different from the actual correct score, if I manually compute it. In general the overall ranking is correct, however, there are cases where one doc should have slightly higher score, but it actually gets a lower score. For example, the return score and actual score rank list may look like:
(_score, my_score)
8.0079155 7.726183013850095
7.4573565 7.240700372005419
5.7250347 5.718247028450972
5.582849 5.3363395329703245
5.35503 5.120882132596499
5.3481665 5.350607821185878
5.231105 5.068864868915258
5.190113 5.007839149255898
4.9003787 4.8552734990413615
Do you have ideas does this happen? Is it because of the 9 bit precision or it's because of the skipping algorithm that does some approximation?
Have you experienced a case where the ranking was incorrect? It looks from the result you submitted that the ranking is incorrect. We would appreciate if you share a reproducible test case that results in wrong ranking.
Do you have ideas does this happen? Is it because of the 9 bit precision or it's because of the skipping algorithm that does some approximation?
Skipping algorithm should not be at blame here. The approximation happens because of 9 bit precision and converting Math.log result to a float score.
Yes. There are cases where the ranking is incorrect as I shown from the example above. My index is pretty big and what's the best way for me to share a reproducible test?
@snakeztc
The best way would be to create an issue in https://github.com/elastic/elasticsearch. rank_feature query scores don't depend on other documents, so showing just these few documents that produce the wrong ranking would be very helpful. Thank you
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.