I'm having a performance issue with my cluster (0.90.3). At ~3M docs per
shard and 5 shards per server the servers can't handle a lot of QPS (load
tests peg a single server @ ~ 80qps) - even though the entire index sits in
memory [the docs are really small]. The nodes are burning a lot of CPU and
it's not due to GC [1 CMS every 2 hours].
There's not a lot of docs about the performance of most queries on ES. I
made a few guesses and wanted to know if I was right.
Prefer using _boost in lieu of custom_score. [I'm assuming ES can do
more index-time optimizations this way].
Merge filters at index time if possible.
Use mapping-level boosting per-field instead of custom_boost_factor.
[In combination with 3] Index dis_max sub-queries to _all and get rid of
the dis_max.
Of course these will sacrifice flexibility, but at this point I'm looking
for performance wins. Do any of these ideas have basis, or is there
something else I can do to get the per server performance up?
With scripting, you burn a lot of CPU. You are not quite right about the
index sitting in memory. Maybe it fits in filesystem cache but your use
case is another one, that is, loading the index docs into the heap to
compute scores and boosts. The hot thread at least shows the script runs
through all the docs loading fields. Doc fetching is expensive. Scripts are
working like this: traversing through all(!) hits of a query, fetching all
docs, and loading required fields into the heap, which takes a lot of
resources, and that is why scripts are always second choice in my eyes.
Yes, boosting at document level when indexing is way more efficient.
The less filters you use the faster the response is.
Thanks for the clarification Jörg. Do you have any guidance on
dis_max([multiple fields]) vs _all with appropriate include_in_all in the
mapping?
Mike.
On Wednesday, October 23, 2013 3:16:45 AM UTC-4, Jörg Prante wrote:
With scripting, you burn a lot of CPU. You are not quite right about the
index sitting in memory. Maybe it fits in filesystem cache but your use
case is another one, that is, loading the index docs into the heap to
compute scores and boosts. The hot thread at least shows the script runs
through all the docs loading fields. Doc fetching is expensive. Scripts are
working like this: traversing through all(!) hits of a query, fetching all
docs, and loading required fields into the heap, which takes a lot of
resources, and that is why scripts are always second choice in my eyes.
Yes, boosting at document level when indexing is way more efficient.
The less filters you use the faster the response is.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.