Sparse vector vs rank features. Which one?

I am trying to implement a customized search function that will rank documents based on sparse features. Concretely, each document will have a list of sparse features, e.g.

doc_1 -> {a: 1.1, b: 0.2}
doc_2 -> {a: 0.2, z: 0.3, zz: 1.2} ...

Now at the query stage, the query will have a list of the sparse dimension that appears, e.g. [a, zz, b]. My scoring function is for each doc is simply added up all the value of terms that appear in both the query and the document. Take the above example,
doc_1_score = 0.2 + 1.2
doc_2_score = 1.1 + 0.2

My question is what is the most efficient way to implement this, I have two ideas now.

  1. using rank features to save all the values in documents, and then using a list of "should" to get the score
  2. save the features as sparse vectors, then using a dot_product to get the final score.

Which one do you think will be more efficient (memory & speed). Is there better way to accomplish this, e.g. using inverted index? Thank you!

Any one whom can help?

Hello there,
we have deprecated sparse_vector datatype and you should not use it anymore. We did not see a good adoption for it.

Using rank_features seems to be a good alternative if you don't have that many features in your query to have a reasonable boolean query.

Thanks for the reply! What is considered as reasonable for boolean query features? Usually I will have 20-100 features in the query. Is that considered okay? Thanks!

20-100 features sounds reasonable. There is a limit on the maximum number of clauses within a boolean query that should not be exceeded.

Cool! Thanks.

(Newbie here) do you think this type of Boolean query that I am Interested can scale and maintain high speed for large index? I have more than 10-50 million documents in the index.

10-50 million docs is not a very big collection, but the performance of a query depends on many factors: for a single clause how many docs contain a particular feature, how many clauses you have in total etc. A good thing with rank features query is that it can efficiently skip non-competitive documents if you just need top N docs and don't need total hit count.

So the best advice for you is to index your collection and test the performance of your queries yourself.

Thank you! I tested myself and the speed is not bad. Is there a place I can read more about how rank features are implemented? E.g what kind of data structure and how it skip non competitive docs etc.

About rank features, you can read more here

We also have a blog devoted to the topic of skipping non-competitive hits.

Further details how rank features field is implemented can be found in Lucene code here

1 Like