TooManyClauses on Bool query + fuzzy like this field


(Matheus Salvia) #1

Hello all, I'm trying to execute the following query:
query = {
"query": {
"bool": {
"must": [
{
"fuzzy_like_this_field": {
field_name: {
"like_text": text,
"min_similarity": 0.95,
"max_query_terms": 10000
}
}
},
{
"has_parent": {
"type": "user",
"query": {
"match": {
"_id": parent_id
}
}
}
}
]
}

        }
    }

I run it as a checkup before indexing some things to avoid having
duplicate/simmilar texts already indexed for a particular user, but
sometimes I'm getting a TooManyClauses Exception. Anyone knows why?
Maybe the max_query_terms is too high? By the way, what does the
max_query_terms do *exactly *? The documentation isn't very clear on
that... By the way the text field can be very long (its the text of some
HTML pages, without the markup), and I need to run a fuzzy match to avoid
duplication, as this can supersize my index very easily, and some pages can
change slightely (thats why the 0.95 similarity) without changing the
semantics itself.

Any help is appreciated.

The stacktrace I'm receiving is:

ElasticHttpError: (500, u'SearchPhaseExecutionException[Failed to execute
phase [query], all shards failed; shardFailures
{[ofOw22icTAycLqH0PIsXug][users][2]:
RemoteTransportException[[Robin][inet[/10.185.31.90:9300]][search/phase/query]];
nested: QueryPhaseExecutionException[[users][2]: query[filtered(+null
+parent_filter[user](filtered(ConstantScore(_uid:bullmerang#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:bullmerang_0184ef0c#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:mouseflow#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:mouseflow_abcd#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:mouseflow_dc67f533#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_abc#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_abcd#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_dc67f533#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_s1mbtest#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:user#5f9e10a8-7b3c-4468-9d8a-e8e788debd62))->cache(_type:user)))->cache(_type:mouseflow_dc67f533)],from[0],size[10]:
Query Failed [Failed to execute main query]]; nested:
RuntimeException[org.apache.lucene.search.BooleanQuery$TooManyClauses:
maxClauseCount is set to 1024]; nested: TooManyClauses[maxClauseCount is
set to 1024]; }{[FQ28Q01yTBaDxyAgTFcidQ][users][3]:
QueryPhaseExecutionException[[users][3]: query[filtered(+null
+parent_filter[user](filtered(ConstantScore(_uid:bullmerang#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:bullmerang_0184ef0c#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:mouseflow#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:mouseflow_abcd#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:mouseflow_dc67f533#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_abc#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_abcd#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_dc67f533#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_s1mbtest#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:user#5f9e10a8-7b3c-4468-9d8a-e8e788debd62))->cache(_type:user)))->cache(_type:mouseflow_dc67f533)],from[0],size[10]:
Query Failed [Failed to execute main query]]; nested:
RuntimeException[org.apache.lucene.search.BooleanQuery$TooManyClauses:
maxClauseCount is set to 1024]; nested: TooManyClauses[maxClauseCount is
set to 1024]; }{[ofOw22icTAycLqH0PIsXug][users][0]:
RemoteTransportException[[Robin][inet[/10.185.31.90:9300]][search/phase/query]];
nested: QueryPhaseExecutionException[[users][0]: query[filtered(+null
+parent_filter[user](filtered(ConstantScore(_uid:bullmerang#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:bullmerang_0184ef0c#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:mouseflow#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:mouseflow_abcd#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:mouseflow_dc67f533#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_abc#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_abcd#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_dc67f533#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_s1mbtest#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:user#5f9e10a8-7b3c-4468-9d8a-e8e788debd62))->cache(_type:user)))->cache(_type:mouseflow_dc67f533)],from[0],size[10]:
Query Failed [Failed to execute main query]]; nested:
RuntimeException[org.apache.lucene.search.BooleanQuery$TooManyClauses:
maxClauseCount is set to 1024]; nested: TooManyClauses[maxClauseCount is
set to 1024]; }{[ofOw22icTAycLqH0PIsXug][users][1]:
RemoteTransportException[[Robin][inet[/10.185.31.90:9300]][search/phase/query]];
nested: QueryPhaseExecutionException[[users][1]: query[filtered(+null
+parent_filter[user](filtered(ConstantScore(_uid:bullmerang#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:bullmerang_0184ef0c#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:mouseflow#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:mouseflow_abcd#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:mouseflow_dc67f533#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_abc#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_abcd#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_dc67f533#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_s1mbtest#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:user#5f9e10a8-7b3c-4468-9d8a-e8e788debd62))->cache(_type:user)))->cache(_type:mouseflow_dc67f533)],from[0],size[10]:
Query Failed [Failed to execute main query]]; nested:
RuntimeException[org.apache.lucene.search.BooleanQuery$TooManyClauses:
maxClauseCount is set to 1024]; nested: TooManyClauses[maxClauseCount is
set to 1024]; }{[FQ28Q01yTBaDxyAgTFcidQ][users][4]:
QueryPhaseExecutionException[[users][4]: query[filtered(+null
+parent_filter[user](filtered(ConstantScore(_uid:bullmerang#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:bullmerang_0184ef0c#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:mouseflow#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:mouseflow_abcd#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:mouseflow_dc67f533#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_abc#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_abcd#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_dc67f533#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:requests_s1mbtest#5f9e10a8-7b3c-4468-9d8a-e8e788debd62
_uid:user#5f9e10a8-7b3c-4468-9d8a-e8e788debd62))->cache(_type:user)))->cache(_type:mouseflow_dc67f533)],from[0],size[10]:
Query Failed [Failed to execute main query]]; nested:
RuntimeException[org.apache.lucene.search.BooleanQuery$TooManyClauses:
maxClauseCount is set to 1024]; nested: TooManyClauses[maxClauseCount is
set to 1024]; }]')

--

// Matheus Salvia
Desenvolvedor Mobile
Celular: +55 11 9-6446-2332
Skype: meta.faraday

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAOJV22Z1Fwdm_3BXbg5hezro9HK_CVOX7jaaTGyK%2BVVHw4H72w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #2

Matheus,

Yes the max_query_terms can cause the TooManyClauses. What the
fuzzy_like_this query does "approximately" is to take your text, break it
down, and construct a big boolean query consisting of a lot of OR clauses.
So let's say your text is "Hello World". It rewrites that "roughly" into
"Hello OR World OR VariantsOfHello OR VariantsOfWorld". The max_query_terms
parameter caps how many of these terms (and variants) will make it into the
big boolean query. For boolean queries, Lucene by default caps the max
clauses to 1024. If absolutely needed, there is a way to increase this
number in ES but it needs to be set in the config file prior to starting up
ES.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fdbd69ca-cc47-48d4-8fa4-9ed0e0e18133%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Matheus Salvia) #3

Thank you Binh, I reduced the max_query_terms to 1000 and the exception
doesn't show up anymore. One last question, from what I've understood of
your answer, the higher I set this value the larger the precision I'm going
to have in fuzzy match, right? But since the default value is 25, do you
think 1000 is enough to match texts with 0.95 similarity? (I'm aiming to
find texts at least 95% similar to the one I have).

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1961daec-1c4f-427d-9236-55597bcb6a33%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4