Why node query cache works only for filters


(Vadim Gindin) #1

Hi, I have a custom query with ml scoring algorithm. I'd like to be able to cache search results before different aggregations/sorting are applied. I.e. request_cache is not suitable and node query cache would be great, but it works only for queries in a filter context. Why?

Could you advise me some cache solution for my custom query? How it is possible to implement custom cache in elasticsearch plugin?

Regards,
Vadim Gindin


(Adrien Grand) #2

Hi Vadim,

Aggregations/sorting don't happen after results are computed but at the same time via a MultiCollector that has multiple sub collectors: one that compute top hits, another one that computes aggregations. What kind of data are you willing to cache exactly?

The query cache only caches non-scoring queries (ie. filters) because caching scores would add a lot of memory overhead. We typically need between 1 bit and 2 bytes per cached docid while caching scores would require at least 4 additional bytes.


(Vadim Gindin) #3

Hi Adrien!

In the current query implementation the scores calculation (using ML algorithm) is more expensive than matching and match processing in scorer. So I'd like to cache right document scores by ids.

I'd probably ready to additional memory consumption. Is it a possibility to use the node cache for that somehow?

--

My current implementation uses bulkScorer to score matched docs using at once. But I've found that bulkScorer() is not called in some cases when the query is wrapped in BooleanQuery and there are more then one required clause (see BooleanWeight.booleanScorer()). So it doesn't look like universal way to us ML scoreing aglorithm. Is there a way to get around this limit?

Thanks!


(Adrien Grand) #4

Hi Vadim,

If the bulk scorer helps because computing multiple scores at once is easier, then you could consider using something like Lucene's BulkScorerWrapperScorer: https://github.com/apache/lucene-solr/blob/910a0231f6fc668426056e31d43e293248ff5ce1/lucene/test-framework/src/java/org/apache/lucene/search/BulkScorerWrapperScorer.java. It was initially designed for testing, but it might be helpful in your case as well.

If you need a more general caching mechanism, I don't think reusing an existing cache is an option. I guess that the easier way to do it would be to fold it directly into your query (since you seem to already have a custom plugin for a query). Make sure to register closed listeners so that the JVM can reclaim memory for segments that have been merged away.


(Vadim Gindin) #5

Hi Adrien!

Could you advise me (or share a link) how to implement closing listeners correctly? As I've found, I should register my own AbstractLifecycleComponent and add there a LifecycleListener that will clear a cache in beforeStop() method. Is that correct?

P.S. I have the other difficulty with BulkScorer:

  1. If my index has nested fields, than org.elasticsearch.search.DefaultSearchContext wraps my query with BooleanQuery and add Occur.FILTER clause to it. (that become DocValuesFieldExistsQuery [field=_primary_term] in fact). Further, when executing BooleanWeight.booleanScorer() method prevents my bulkScorer been executed, because the final query has 2 required clauses (have a look there).
  2. I'd like to be able to wrap my query to boolean query by myself.

So, in these cases (when my query is wrapped to boolean) my bulkScorer is not called at all. ThereforŠµ my search does not work at all. (not only caching)

Is there a way to overcome this limitation somehow?

Thanks,
Vadim Gindin


(Adrien Grand) #6

Hi Vadim,

Actually no, you could do it at the level of your query directly. I guess the closest example to that that we have is the query cache, see eg. calls to addClosedListener at https://github.com/apache/lucene-solr/blob/b4449c73e4c1ed34bc155ae5a818ac1a870ea7f8/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java.

Indeed Elasticsearch adds filters implicitly. BulkScorer is supposed to be an optimization rather than the regular API for queries to implement. That said if this API is more convenient for you to implement, you could make the Weight#scorer method return something that looks like the BulkScorerWrapperScorer wrapper class that I showed above, which wraps a BulkScorer and implements the Scorer API.