Large variance in query times

We're experiencing large variance in query times that I'm not sure how to
diagnose. Our setup is as follows:

12 nodes (hexcore hyperthreaded, 64GB memory, 2x 3TB in RAID0 config)
One index, 200 shards, 1 replica. ~20TB including replicas. ~160m docs.
32GB JVM heap
Indexing ~150 docs/s on average. Load ~1.5. Documents are a break

Aside from index and bulk threadpools (set to core counts, blocking)
everything else is default.

Docs are ~as follows:

{
"site": long,
"countries": long,
"text": {"standard": "string", "en": "string", "ru": "string", ... for
all available analyzers, only indexed detected doc language},
"publication_date": date,
... other longs and non analyzed terms
}

Queries ~are:

{
"query_string": {"query": "...", "fields": ["text.standard", ...]},
"facets": {
"site": term facet,
"countries": term facet,
"publication_date" histogram
},
"range_filter": on publication date,
"term_filter": on sites and countries
}

Currently queries take about 10-15 seconds, but ofter hit 75s (nginx
timeout). We've had issues with failed merges resulting in shards that had
huge segment counts. I use Lucene's CheckIndex to "-fix" these issues. I
then _optimised to 1 segment out of curiosity to see how much this affected
performance. Search times were decreased to around the 1-5s mark. Great,
but the process caused huge loads for about 6-8 hours. To try to keep the
segment counts low I set optimize to run daily with a segment count of 3.
Again there was a lot of instability.

I've keep details brief, assuming that the gist would probably highlight
obvious wtf moments that will be highlighted. Really what I want to know,
in no particular order, is:

(0. What sounds ridiculous in the above.)

  1. Is it possible to get a breakdown of query execution (ie. took this
    long executing on shard x, it was merging at the time)
  2. What's a good strategy for keeping segment count down:
    • Without killing the cluster. There are a lot of settings to
      throttle merges that sounds applicable, my concern is that just any merge
      is enough to cause massive query times.
    • Does this sound like something I should concentrate on?
    • perhaps more frequent merges will cause shorter freezes.
    • Is optimize something you should expect to have to run, or is there
      something wrong with the setup
  3. Does the shard count sound "out there" for the doc count/size/etc
  4. How do I optimize the heap/file system cache balance (when do I
    allocate more to the JVM, file system cache), does it sound like this would
    help?
  5. How do other people go about profiling these types of issues

Details on request/interest.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

A few omission from the previous post:

  • Running ubuntu 12.04 and OpenJDK (IcedTea 2.3.9).
  • I'm using a refresh interval of 3600s although refresh is called
    explicitly when required, about every 10 mins.

On Monday, 3 June 2013 23:47:04 UTC+1, hazzadous wrote:

We're experiencing large variance in query times that I'm not sure how to
diagnose. Our setup is as follows:

12 nodes (hexcore hyperthreaded, 64GB memory, 2x 3TB in RAID0 config)
One index, 200 shards, 1 replica. ~20TB including replicas. ~160m docs.
32GB JVM heap
Indexing ~150 docs/s on average. Load ~1.5. Documents are a break

Aside from index and bulk threadpools (set to core counts, blocking)
everything else is default.

Docs are ~as follows:

{
"site": long,
"countries": long,
"text": {"standard": "string", "en": "string", "ru": "string", ... for
all available analyzers, only indexed detected doc language},
"publication_date": date,
... other longs and non analyzed terms
}

Queries ~are:

{
"query_string": {"query": "...", "fields": ["text.standard", ...]},
"facets": {
"site": term facet,
"countries": term facet,
"publication_date" histogram
},
"range_filter": on publication date,
"term_filter": on sites and countries
}

Currently queries take about 10-15 seconds, but ofter hit 75s (nginx
timeout). We've had issues with failed merges resulting in shards that had
huge segment counts. I use Lucene's CheckIndex to "-fix" these issues. I
then _optimised to 1 segment out of curiosity to see how much this affected
performance. Search times were decreased to around the 1-5s mark. Great,
but the process caused huge loads for about 6-8 hours. To try to keep the
segment counts low I set optimize to run daily with a segment count of 3.
Again there was a lot of instability.

I've keep details brief, assuming that the gist would probably highlight
obvious wtf moments that will be highlighted. Really what I want to know,
in no particular order, is:

(0. What sounds ridiculous in the above.)

  1. Is it possible to get a breakdown of query execution (ie. took this
    long executing on shard x, it was merging at the time)
  2. What's a good strategy for keeping segment count down:
    • Without killing the cluster. There are a lot of settings to
      throttle merges that sounds applicable, my concern is that just any merge
      is enough to cause massive query times.
    • Does this sound like something I should concentrate on?
    • perhaps more frequent merges will cause shorter freezes.
    • Is optimize something you should expect to have to run, or is there
      something wrong with the setup
  3. Does the shard count sound "out there" for the doc count/size/etc
  4. How do I optimize the heap/file system cache balance (when do I
    allocate more to the JVM, file system cache), does it sound like this would
    help?
  5. How do other people go about profiling these types of issues

Details on request/interest.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

One more omission, running ES 0.20.4

On Tuesday, 4 June 2013 14:50:20 UTC+1, hazzadous wrote:

A few omission from the previous post:

  • Running ubuntu 12.04 and OpenJDK (IcedTea 2.3.9).
  • I'm using a refresh interval of 3600s although refresh is called
    explicitly when required, about every 10 mins.

On Monday, 3 June 2013 23:47:04 UTC+1, hazzadous wrote:

We're experiencing large variance in query times that I'm not sure how to
diagnose. Our setup is as follows:

12 nodes (hexcore hyperthreaded, 64GB memory, 2x 3TB in RAID0 config)
One index, 200 shards, 1 replica. ~20TB including replicas. ~160m docs.
32GB JVM heap
Indexing ~150 docs/s on average. Load ~1.5. Documents are a break

Aside from index and bulk threadpools (set to core counts, blocking)
everything else is default.

Docs are ~as follows:

{
"site": long,
"countries": long,
"text": {"standard": "string", "en": "string", "ru": "string", ... for
all available analyzers, only indexed detected doc language},
"publication_date": date,
... other longs and non analyzed terms
}

Queries ~are:

{
"query_string": {"query": "...", "fields": ["text.standard", ...]},
"facets": {
"site": term facet,
"countries": term facet,
"publication_date" histogram
},
"range_filter": on publication date,
"term_filter": on sites and countries
}

Currently queries take about 10-15 seconds, but ofter hit 75s (nginx
timeout). We've had issues with failed merges resulting in shards that had
huge segment counts. I use Lucene's CheckIndex to "-fix" these issues. I
then _optimised to 1 segment out of curiosity to see how much this affected
performance. Search times were decreased to around the 1-5s mark. Great,
but the process caused huge loads for about 6-8 hours. To try to keep the
segment counts low I set optimize to run daily with a segment count of 3.
Again there was a lot of instability.

I've keep details brief, assuming that the gist would probably highlight
obvious wtf moments that will be highlighted. Really what I want to know,
in no particular order, is:

(0. What sounds ridiculous in the above.)

  1. Is it possible to get a breakdown of query execution (ie. took this
    long executing on shard x, it was merging at the time)
  2. What's a good strategy for keeping segment count down:
    • Without killing the cluster. There are a lot of settings to
      throttle merges that sounds applicable, my concern is that just any merge
      is enough to cause massive query times.
    • Does this sound like something I should concentrate on?
    • perhaps more frequent merges will cause shorter freezes.
    • Is optimize something you should expect to have to run, or is
      there something wrong with the setup
  3. Does the shard count sound "out there" for the doc count/size/etc
  4. How do I optimize the heap/file system cache balance (when do I
    allocate more to the JVM, file system cache), does it sound like this would
    help?
  5. How do other people go about profiling these types of issues

Details on request/interest.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

OpenJDK (IcedTea) != Java. Any time spent running ES on OpenJDK is time
wasted.

Install and use real Oracle (aka Sun) Java. Works fine on Ubuntu too (which
is the way I run ES on all systems: Mac, Ubuntu, and Solaris x86-64).

Brian

On Tuesday, June 4, 2013 9:50:20 AM UTC-4, hazzadous wrote:

A few omission from the previous post:

  • Running ubuntu 12.04 and OpenJDK (IcedTea 2.3.9).
  • I'm using a refresh interval of 3600s although refresh is called
    explicitly when required, about every 10 mins.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Brian will give it a try.

On Tuesday, 4 June 2013 17:11:35 UTC+1, InquiringMind wrote:

OpenJDK (IcedTea) != Java. Any time spent running ES on OpenJDK is time
wasted.

Install and use real Oracle (aka Sun) Java. Works fine on Ubuntu too
(which is the way I run ES on all systems: Mac, Ubuntu, and Solaris x86-64).

Brian

On Tuesday, June 4, 2013 9:50:20 AM UTC-4, hazzadous wrote:

A few omission from the previous post:

  • Running ubuntu 12.04 and OpenJDK (IcedTea 2.3.9).
  • I'm using a refresh interval of 3600s although refresh is called
    explicitly when required, about every 10 mins.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.