Help! How can i get the docFreq in ElasticSearch?


(minghua.cao) #1

how can get the docFreq in Elasticsearch , I found that the TermsEnum has a member fucntion docFreq(), how can get docfreq statistic infomation by passing a String format term.

function like this:
long getDocFreq(String term){ return x.docFreq();}
and then docfreq is at best auto cached by elasticsearch.

Actually because i need this number to be returned in my suggest result.


(Nik Everett) #2

Its an enum that has to be positioned at the term before reading the docFreq. Think of it as the read head of a hard drive. There are like 12397 layers of caching and abstraction between it and the actual read head but I think its an OK way to think of it.

Anyway, its generally better to fetch the docFreq when you've already positioned the terms enum. The suggesters very likely already do that so you should just find it. But to answer your question, this is how you'd do it from scratch:

BytesRef spare = new BytesRef();
public int getDocFreq(String term) {
  spare.copyChars(term);
  if (termsEnum.seekExact(spare)) {
    return termsEnum.docFreq();
  }
  return 0;
}

Note that this whole things is not thread safe. The spare gets involved because Lucene stores everything as UTF-8 instead of UTF-16 like Java strings so they have to be converted. Part of the Lucene philosophy is not to allocate memory for things like this so spare will only grow when needed. Otherwise it just references the decoded bytes of the last term you looked up.

You may want to send a PR when you are done. Keep in mind that docFreq on the TermsEnum is per shard and it includes deleted docs. So its not an exact thing.

It'll actually be an int because lucene indexes can only hold up to Integer.MAX_VALUE documents at a time. The whole Elasticsearch index can hold more because there is more than one shard but suggestions are performed on per index numbers.


(Nik Everett) #3

Also, like a read head its better if you read forward then backwards. The terms are stored in sorted order - I believe sorted by their UTF-8 byte representation.


(minghua.cao) #4

Firstly thank you very much for your help!
My requirement is need to return the doc freq of my suggest word, this freq number is not need to be such exact accurate. so thread safe can be accepted.
Accord to your answer you mentioned that it is only the doc of one shard and contains the deleted doc, so this docfreq may can't match my requirement.
what i need is a [never deleted doc freq] of term, and I can maintain it in a cache like redis or memcache.
what my current solution is compute it before index and store it in memcache. It is not that convenient for update.
So what i want to know is if the elasticsearch store this docfreq, and if there is an api to get it, if no , how can i maintain this in the doc index time.


(system) #5