We're experiencing large variance in query times that I'm not sure how to
diagnose. Our setup is as follows:
12 nodes (hexcore hyperthreaded, 64GB memory, 2x 3TB in RAID0 config)
One index, 200 shards, 1 replica. ~20TB including replicas. ~160m docs.
32GB JVM heap
Indexing ~150 docs/s on average. Load ~1.5. Documents are a break
Aside from index and bulk threadpools (set to core counts, blocking)
everything else is default.
Docs are ~as follows:
{
"site": long,
"countries": long,
"text": {"standard": "string", "en": "string", "ru": "string", ... for
all available analyzers, only indexed detected doc language},
"publication_date": date,
... other longs and non analyzed terms
}
Queries ~are:
{
"query_string": {"query": "...", "fields": ["text.standard", ...]},
"facets": {
"site": term facet,
"countries": term facet,
"publication_date" histogram
},
"range_filter": on publication date,
"term_filter": on sites and countries
}
Currently queries take about 10-15 seconds, but ofter hit 75s (nginx
timeout). We've had issues with failed merges resulting in shards that had
huge segment counts. I use Lucene's CheckIndex to "-fix" these issues. I
then _optimised to 1 segment out of curiosity to see how much this affected
performance. Search times were decreased to around the 1-5s mark. Great,
but the process caused huge loads for about 6-8 hours. To try to keep the
segment counts low I set optimize to run daily with a segment count of 3.
Again there was a lot of instability.
I've keep details brief, assuming that the gist would probably highlight
obvious wtf moments that will be highlighted. Really what I want to know,
in no particular order, is:
(0. What sounds ridiculous in the above.)
- Is it possible to get a breakdown of query execution (ie. took this
long executing on shard x, it was merging at the time) - What's a good strategy for keeping segment count down:
- Without killing the cluster. There are a lot of settings to
throttle merges that sounds applicable, my concern is that just any merge
is enough to cause massive query times. - Does this sound like something I should concentrate on?
- perhaps more frequent merges will cause shorter freezes.
- Is optimize something you should expect to have to run, or is there
something wrong with the setup
- Without killing the cluster. There are a lot of settings to
- Does the shard count sound "out there" for the doc count/size/etc
- How do I optimize the heap/file system cache balance (when do I
allocate more to the JVM, file system cache), does it sound like this would
help? - How do other people go about profiling these types of issues
Details on request/interest.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.