Sparse vector vs rank features. Which one?

I am trying to implement a customized search function that will rank documents based on sparse features. Concretely, each document will have a list of sparse features, e.g.

doc_1 -> {a: 1.1, b: 0.2}
doc_2 -> {a: 0.2, z: 0.3, zz: 1.2} ...

Now at the query stage, the query will have a list of the sparse dimension that appears, e.g. [a, zz, b]. My scoring function is for each doc is simply added up all the value of terms that appear in both the query and the document. Take the above example,
doc_1_score = 0.2 + 1.2
doc_2_score = 1.1 + 0.2

My question is what is the most efficient way to implement this, I have two ideas now.

  1. using rank features to save all the values in documents, and then using a list of "should" to get the score
  2. save the features as sparse vectors, then using a dot_product to get the final score.

Which one do you think will be more efficient (memory & speed). Is there better way to accomplish this, e.g. using inverted index? Thank you!

Any one whom can help?

Hello there,
we have deprecated sparse_vector datatype and you should not use it anymore. We did not see a good adoption for it.

Using rank_features seems to be a good alternative if you don't have that many features in your query to have a reasonable boolean query.

Thanks for the reply! What is considered as reasonable for boolean query features? Usually I will have 20-100 features in the query. Is that considered okay? Thanks!

20-100 features sounds reasonable. There is a limit on the maximum number of clauses within a boolean query that should not be exceeded.

Cool! Thanks.

(Newbie here) do you think this type of Boolean query that I am Interested can scale and maintain high speed for large index? I have more than 10-50 million documents in the index.

10-50 million docs is not a very big collection, but the performance of a query depends on many factors: for a single clause how many docs contain a particular feature, how many clauses you have in total etc. A good thing with rank features query is that it can efficiently skip non-competitive documents if you just need top N docs and don't need total hit count.

So the best advice for you is to index your collection and test the performance of your queries yourself.

Thank you! I tested myself and the speed is not bad. Is there a place I can read more about how rank features are implemented? E.g what kind of data structure and how it skip non competitive docs etc.

@snakeztc
About rank features, you can read more here

We also have a blog devoted to the topic of skipping non-competitive hits.

Further details how rank features field is implemented can be found in Lucene code here

1 Like