Sparse vector vs rank features. Which one?

I am trying to implement a customized search function that will rank documents based on sparse features. Concretely, each document will have a list of sparse features, e.g.

doc_1 -> {a: 1.1, b: 0.2}
doc_2 -> {a: 0.2, z: 0.3, zz: 1.2} ...

Now at the query stage, the query will have a list of the sparse dimension that appears, e.g. [a, zz, b]. My scoring function is for each doc is simply added up all the value of terms that appear in both the query and the document. Take the above example,
doc_1_score = 0.2 + 1.2
doc_2_score = 1.1 + 0.2

My question is what is the most efficient way to implement this, I have two ideas now.

  1. using rank features to save all the values in documents, and then using a list of "should" to get the score
  2. save the features as sparse vectors, then using a dot_product to get the final score.

Which one do you think will be more efficient (memory & speed). Is there better way to accomplish this, e.g. using inverted index? Thank you!

Any one whom can help?

Hello there,
we have deprecated sparse_vector datatype and you should not use it anymore. We did not see a good adoption for it.

Using rank_features seems to be a good alternative if you don't have that many features in your query to have a reasonable boolean query.

Thanks for the reply! What is considered as reasonable for boolean query features? Usually I will have 20-100 features in the query. Is that considered okay? Thanks!

20-100 features sounds reasonable. There is a limit on the maximum number of clauses within a boolean query that should not be exceeded.

Cool! Thanks.

(Newbie here) do you think this type of Boolean query that I am Interested can scale and maintain high speed for large index? I have more than 10-50 million documents in the index.

10-50 million docs is not a very big collection, but the performance of a query depends on many factors: for a single clause how many docs contain a particular feature, how many clauses you have in total etc. A good thing with rank features query is that it can efficiently skip non-competitive documents if you just need top N docs and don't need total hit count.

So the best advice for you is to index your collection and test the performance of your queries yourself.

Thank you! I tested myself and the speed is not bad. Is there a place I can read more about how rank features are implemented? E.g what kind of data structure and how it skip non competitive docs etc.

About rank features, you can read more here

We also have a blog devoted to the topic of skipping non-competitive hits.

Further details how rank features field is implemented can be found in Lucene code here

1 Like

Hi I have implemented the rank_feature as your suggested and it works great in general. I am using log-based rank_feature with scaling factor = 1. I.e. y = log(x + 1)

one thing I notice that, the return score sometime is different from the actual correct score, if I manually compute it. In general the overall ranking is correct, however, there are cases where one doc should have slightly higher score, but it actually gets a lower score. For example, the return score and actual score rank list may look like:
(_score, my_score)
8.0079155 7.726183013850095
7.4573565 7.240700372005419
5.7250347 5.718247028450972
5.582849 5.3363395329703245
5.35503 5.120882132596499
5.3481665 5.350607821185878
5.231105 5.068864868915258
5.190113 5.007839149255898
4.9003787 4.8552734990413615

Do you have ideas does this happen? Is it because of the 9 bit precision or it's because of the skipping algorithm that does some approximation?

In general the overall ranking is correct

Have you experienced a case where the ranking was incorrect? It looks from the result you submitted that the ranking is incorrect. We would appreciate if you share a reproducible test case that results in wrong ranking.

Do you have ideas does this happen? Is it because of the 9 bit precision or it's because of the skipping algorithm that does some approximation?

Skipping algorithm should not be at blame here. The approximation happens because of 9 bit precision and converting Math.log result to a float score.

Yes. There are cases where the ranking is incorrect as I shown from the example above. My index is pretty big and what's the best way for me to share a reproducible test?

The best way would be to create an issue in
rank_feature query scores don't depend on other documents, so showing just these few documents that produce the wrong ranking would be very helpful. Thank you

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.