As we have more and more customers (it's a good news) we have more and more values.
I 'm trying to refactor our queries and I found the Terms lookup mechanism, it's very
interesting in our case. BUT, the "lookup" values may be bigger, the max size actually is 118195 keys for 1 user.
So when I add the terms query, I have the following error (that's normal I understand)
Caused by: NotSerializableExceptionWrapper[too_many_clauses: maxClauseCount is set to 1024]
If I change the configuration a big size it's works OK, but is it a good solution ? What problem we can have ?
Some of you have a hight value for index.query.bool.max_clause_count: 100000 ?
Here is the "project" :
Let's say we have products in a index, and imagine that customers can "subcribe" to products.
A customer can search his products, so I use the Terms lookup mechanism as a Join between products
and user selections. The problem is that some users have suscribed to more than 1024 products.
If it was a super simple query, I would run multiple queries with splitted subscriptions, but the exisiting query
is already so big and complex that it's not possible to do that.
NOTE: The solution we have actually is that the product index has a field with userid, but I 'm looking for another architecture.
Maybe you could use parent-child to maintain the relationship. Let the product document be the parent and create a child document per user that subscribed to this product. Add a child to subscribe and remove it to unsubscribe.
I have questions about this solution:
First, you need to know that we run full reindexations and index rolling every weekend.
1°/ would you create a child document per user in the product index or into another. If you mean into the product, it's the same solution we already have , and that's why we want to change (fullreindex is to long because the query is superrrrrr long)
2°/ Is it a problem to have a parent-child relationchip between indices that are dropped and reindexed ? (But they are knowned by a fix alias)
@Christian_Dahlqvist Do you have a recommendation about the max limit for this field ? It's Long ids in our case... And the biggeste array is about 118K :-/
Reindexing are done every weekend (and partial in the night) because there are changes not synchronized from our MySQL database. For old crappy reasons
Increase the number of terms using BooleanQuery.setMaxClauseCount(). Note that this will increase the memory requirements for searches that expand to many terms. To deactivate any limits, use BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE).
I would discourage to increase the maximum number of clauses, it tends to be a source of problem as Lucene may need to read from {max_clause_count} locations on the disk in parallel.
In general, terms queries are only subject to the maximum clause count if their score is required. So I'm wondering that you could work around your issue by just putting your terms query in a filter context, such as under a constant_score query or in a boolfilter clause (assuming that you don't need scores for that query)?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.