Rank_feature takes a lot of space

I am using rank features for a customized ranking function. I used rank_features as the data type, and a document usually contains about 5000 features out of 30K possible keys. It's taking way too much space though.

For example, an index with 2 million documents has:
2M x 5000 float numbers x 4 bytes ~= 40G. However, this index is actually taking 290G disk space. Can someone help me with what's wrong? (note, there is no replica here)

I attached the mapping:
"properties": {
"term_scores": {
"type": "rank_features"
}

An example term_scores look like:
{term_scores: {
"1": 1.2,
"23": 4.2,
.... // about 5000 values from a vocabulary size of 30K

Hi @snakeztc,

it great to hear you are taking an interest in the "rank_features" field. I'm not sure I understand how you measure the impact of the field to the index size on disc. Note, for example, that the index will by default store the original Json string for each document in the "_source" field, also there are auxiliar data structures to consider. Doing a rough calculation, with 290G each of the 2M docs clocks in with around 156kb (or ~32 Bytes per feature), thats not too beat IMHO? You can try disabeling "_source" just to take out some of that impact although it might be impractical on the long run depending on your use case.
Also, using 5000 features in the field sound quite a bit, also later on the query side. Can you elaborate a bit on your use case and query patterns so we can understand it better?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.