Stateful or cached search

hi all, I wondering if there is some possibility to store (cache) a result set under some name or ID and use it in further queries, combining with new constraints? We use only the filter context and the index is relatively fixed (stays untouched during 24h), but the total document count is > 250m and partial result sets can containt up to 10m hits.
I would be very happy to receive information and suggestions for solving it. Thank you. :grinning:

Hey,

Apache Lucene creates some data reusable structures (bitsets used for filtering) automatically in the background, once a certain filter has been used several times. See https://www.elastic.co/guide/en/elasticsearch/reference/7.9/query-cache.html

You can also cache complete search results (however this applies to a full search including all filters, so unless you know complete searches repeat, that might not be what you are after), see https://www.elastic.co/guide/en/elasticsearch/reference/7.9/shard-request-cache.html

If you explain a little bit more about your use-case including sample documents and queries, maybe the index or the queries can be optimized as well for such a use-case. Are there certain queries that do not match your SLAs at all for example?

Hi Alexander, thank you for your help. I was writing a long text but replaced it last moment with the short question in order to save readers time. :slight_smile:

The main use case is building result set as intersection of partial queries using data sets instead of boolean formula. For ex: a user searches for bakeries in several counties having certain size (number of employees > x or sales revenue > y). He exports the hit list and uses it for a mailing action. After some weeks he wants to repeat the mailing action also (but not only) for smaler businesses, so he needs to adjust search criteria, but wants to exclude the previuos hits from the new result.
He could try to construct the complex boolean query by excluding contraints from previous search, but it’s to complicated in a simple search form and there is a risk to exclude newly added businesses satisfying the previous conditions. So he wants just to exclude the list of entries.

The current application „imports“ the old list (up to 10m hits) by querying of unique ids in small packages (like elasticsearch‘s terms filter with max. 64K terms) and „cumulate“ them in one big result set (yes, it is a huge bit vector). During this the app stays „responsive“ because the whole import process is divided into small parts. After it this „stored result set“ is used in new queries (several times because the adjustment happens as try&error) like „and not in <old_file>“.
Therefore I asked for named result sets.Some of users uses this feature extensively by combining one list with another excluding the third... It seems to be simpler for people to operate with quantities instead of logical formula.

Depending on your use-case and data modelling, couldnt a timestamp help, when a bakery has been contacted the last time to exclude it from the result set for the next mailing?

That means flagging affected entries used in some lists, ie trying to solve the problem via the data due to lack of the program logic... We used to solve a lot of stuf this way earlier, such as writing „analyzed“ strings (unique lower cased words excluding stop words and shifted) in extra search columns, but I hoped those days are over.:slight_smile:
There are a lot of exports, since we are not the only one CRM system, but rather a company database and ‚searchers‘ are our customers. We don't want to flag our data after every export that somebody has done. On the contrary: the exported list is up to the customer - he could reuse it, but doesn't have to.

I was hoping the bit vector can be used, I think it is used for the scrolling. There is also a scroll ID, I just wanted to use this id as a kind of filter in the next query, because the search context remains unchanged.