Okay, let's attack this directly. We have a cluster of 6 machines (6
nodes). We have an index of just under 3.5 million documents. Each document
represents an Internet domain name. We are performing queries against this
index to see names that exist in our index. Most queries are coming back in
the sub-50ms range. But a bunch are taking 600ms to 900ms and, thus,
showing up in our slow query log. If they ALL were performing at this
speed, I'd wouldn't be nearly as confused, but it looks like only about 10%
to 20% of the queries are "slow." That's clearly too much.
Head reports that this index looks like this:
size: 424Mi (2.47Gi)
docs: 3,428,471 (3,428,471)
Here is the configuration for a typical node (they're all pretty-much the
same). We have 2 machines in a dev data center, 2 machines in a mesa data
center and 2 machines in a phx data center. Each of the two machines in a
data center has a "node.zone" tag set, and, as you can see, I have the
cluster routing awareness set to see "zone" as its marching orders. The
data pipes between the data centers are beefy, and while I acknowledge that
cross-DC isn't something that's generally smiled-upon, it appears to work
Each machine has 96G of RAM. We start ES giving it 30G for the heap size.
File descriptors are set at 64,000. Note that I've selected the memory
mapped file system.
Server-specific settings for cluster domainiq-es
discovery.zen.ping.unicast.hosts: ["dev2.glbt1.gdg", "m1p1.mesa1.gdg",
"m1p4.mesa1.gdg", "p3p3.phx3.gdg", "p3p4.phx3.gdg"]
The following configuration items should be the same for all ES servers
And here is a typical slow query:
[2014-07-31 07:35:31,530][WARN ][index.search.slowlog.query] [Mesa-03]
[aftermarket-2014-07-31_02-38-19] took[707.6ms], took_millis,
types[premium], stats, search_type[QUERY_THEN_FETCH], total_shards,
OR tokens:(((pet^1.2 pets^1.0 *^1.0)AND(us^1.2 *^0.8)AND(ie^1.2
*^0.6)AND(s^1.2 *^0.4)) OR((pet^1.2 pets^1.0)AND(us^1.2)AND(ie^1.2))^3.0)
AND tld:(com^1.001 OR in^0.99 OR co.in^0.941174367459617 OR
net.in^0.8848832474555992 OR us^0.85 OR org.in^0.8397882862729736 OR
gen.in^0.785829669672289 OR firm.in^0.7414549824163524 OR ind.in^0.7 OR
So note that I create 5 shards and 5 replicas, so that each node has all 5
shards at all times. I THOUGHT THIS MEANT BETTER PERFORMANCE. That is, I
thought having all 5 shards on every node meant that a query to a node
didn't have to ask another node for data. IS THIS NOT TRUE?
Here's where it also gets interesting: I tried setting the number of shards
to 2 (with 5 replicas) and my slow queries went to almost 2 seconds
(2000ms). This is also terribly counter-intuitive! I thought fewer shards
meant less lookup time.
Clearly, I want to optimize for read here. I don't care if indexing is
three times as slow, we need our queries to be sub-100ms.
Any help is SERIOUSLY appreciated (and if you're in the Bay Area, I'm not
above bribes of beer :-))
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to firstname.lastname@example.org.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/43b8bd8a-b20f-49de-a99d-825168095d6a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.