Sparse vector vs rank features. Which one?

I am trying to implement a customized search function that will rank documents based on sparse features. Concretely, each document will have a list of sparse features, e.g.

doc_1 -> {a: 1.1, b: 0.2}
doc_2 -> {a: 0.2, z: 0.3, zz: 1.2} ...

Now at the query stage, the query will have a list of the sparse dimension that appears, e.g. [a, zz, b]. My scoring function is for each doc is simply added up all the value of terms that appear in both the query and the document. Take the above example,
doc_1_score = 0.2 + 1.2
doc_2_score = 1.1 + 0.2

My question is what is the most efficient way to implement this, I have two ideas now.

  1. using rank features to save all the values in documents, and then using a list of "should" to get the score
  2. save the features as sparse vectors, then using a dot_product to get the final score.

Which one do you think will be more efficient (memory & speed). Is there better way to accomplish this, e.g. using inverted index? Thank you!

Any one whom can help?

Hello there,
we have deprecated sparse_vector datatype and you should not use it anymore. We did not see a good adoption for it.

Using rank_features seems to be a good alternative if you don't have that many features in your query to have a reasonable boolean query.

Thanks for the reply! What is considered as reasonable for boolean query features? Usually I will have 20-100 features in the query. Is that considered okay? Thanks!

20-100 features sounds reasonable. There is a limit on the maximum number of clauses within a boolean query that should not be exceeded.

Cool! Thanks.

(Newbie here) do you think this type of Boolean query that I am Interested can scale and maintain high speed for large index? I have more than 10-50 million documents in the index.

10-50 million docs is not a very big collection, but the performance of a query depends on many factors: for a single clause how many docs contain a particular feature, how many clauses you have in total etc. A good thing with rank features query is that it can efficiently skip non-competitive documents if you just need top N docs and don't need total hit count.

So the best advice for you is to index your collection and test the performance of your queries yourself.

Thank you! I tested myself and the speed is not bad. Is there a place I can read more about how rank features are implemented? E.g what kind of data structure and how it skip non competitive docs etc.

@snakeztc
About rank features, you can read more here

We also have a blog devoted to the topic of skipping non-competitive hits.

Further details how rank features field is implemented can be found in Lucene code here

Hi I have implemented the rank_feature as your suggested and it works great in general. I am using log-based rank_feature with scaling factor = 1. I.e. y = log(x + 1)

one thing I notice that, the return score sometime is different from the actual correct score, if I manually compute it. In general the overall ranking is correct, however, there are cases where one doc should have slightly higher score, but it actually gets a lower score. For example, the return score and actual score rank list may look like:
(_score, my_score)
8.0079155 7.726183013850095
7.4573565 7.240700372005419
5.7250347 5.718247028450972
5.582849 5.3363395329703245
5.35503 5.120882132596499
5.3481665 5.350607821185878
5.231105 5.068864868915258
5.190113 5.007839149255898
4.9003787 4.8552734990413615

Do you have ideas does this happen? Is it because of the 9 bit precision or it's because of the skipping algorithm that does some approximation?

In general the overall ranking is correct

Have you experienced a case where the ranking was incorrect? It looks from the result you submitted that the ranking is incorrect. We would appreciate if you share a reproducible test case that results in wrong ranking.

Do you have ideas does this happen? Is it because of the 9 bit precision or it's because of the skipping algorithm that does some approximation?

Skipping algorithm should not be at blame here. The approximation happens because of 9 bit precision and converting Math.log result to a float score.

Yes. There are cases where the ranking is incorrect as I shown from the example above. My index is pretty big and what's the best way for me to share a reproducible test?

@snakeztc
The best way would be to create an issue in https://github.com/elastic/elasticsearch.
rank_feature query scores don't depend on other documents, so showing just these few documents that produce the wrong ranking would be very helpful. Thank you