Large variance in query times

hazzadous · June 3, 2013, 10:47pm

We're experiencing large variance in query times that I'm not sure how to
diagnose. Our setup is as follows:

12 nodes (hexcore hyperthreaded, 64GB memory, 2x 3TB in RAID0 config)
One index, 200 shards, 1 replica. ~20TB including replicas. ~160m docs.
32GB JVM heap
Indexing ~150 docs/s on average. Load ~1.5. Documents are a break

Aside from index and bulk threadpools (set to core counts, blocking)
everything else is default.

Docs are ~as follows:

{
"site": long,
"countries": long,
"text": {"standard": "string", "en": "string", "ru": "string", ... for
all available analyzers, only indexed detected doc language},
"publication_date": date,
... other longs and non analyzed terms
}

Queries ~are:

{
"query_string": {"query": "...", "fields": ["text.standard", ...]},
"facets": {
"site": term facet,
"countries": term facet,
"publication_date" histogram
},
"range_filter": on publication date,
"term_filter": on sites and countries
}

Currently queries take about 10-15 seconds, but ofter hit 75s (nginx
timeout). We've had issues with failed merges resulting in shards that had
huge segment counts. I use Lucene's CheckIndex to "-fix" these issues. I
then _optimised to 1 segment out of curiosity to see how much this affected
performance. Search times were decreased to around the 1-5s mark. Great,
but the process caused huge loads for about 6-8 hours. To try to keep the
segment counts low I set optimize to run daily with a segment count of 3.
Again there was a lot of instability.

I've keep details brief, assuming that the gist would probably highlight
obvious wtf moments that will be highlighted. Really what I want to know,
in no particular order, is:

(0. What sounds ridiculous in the above.)

Is it possible to get a breakdown of query execution (ie. took this
long executing on shard x, it was merging at the time)
What's a good strategy for keeping segment count down:
- Without killing the cluster. There are a lot of settings to
  throttle merges that sounds applicable, my concern is that just any merge
  is enough to cause massive query times.
- Does this sound like something I should concentrate on?
- perhaps more frequent merges will cause shorter freezes.
- Is optimize something you should expect to have to run, or is there
  something wrong with the setup
Does the shard count sound "out there" for the doc count/size/etc
How do I optimize the heap/file system cache balance (when do I
allocate more to the JVM, file system cache), does it sound like this would
help?
How do other people go about profiling these types of issues

Details on request/interest.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

hazzadous · June 4, 2013, 1:50pm

A few omission from the previous post:

Running ubuntu 12.04 and OpenJDK (IcedTea 2.3.9).
I'm using a refresh interval of 3600s although refresh is called
explicitly when required, about every 10 mins.

On Monday, 3 June 2013 23:47:04 UTC+1, hazzadous wrote:

We're experiencing large variance in query times that I'm not sure how to
diagnose. Our setup is as follows:

12 nodes (hexcore hyperthreaded, 64GB memory, 2x 3TB in RAID0 config)
One index, 200 shards, 1 replica. ~20TB including replicas. ~160m docs.
32GB JVM heap
Indexing ~150 docs/s on average. Load ~1.5. Documents are a break

Aside from index and bulk threadpools (set to core counts, blocking)
everything else is default.

Docs are ~as follows:

{
"site": long,
"countries": long,
"text": {"standard": "string", "en": "string", "ru": "string", ... for
all available analyzers, only indexed detected doc language},
"publication_date": date,
... other longs and non analyzed terms
}

Queries ~are:

{
"query_string": {"query": "...", "fields": ["text.standard", ...]},
"facets": {
"site": term facet,
"countries": term facet,
"publication_date" histogram
},
"range_filter": on publication date,
"term_filter": on sites and countries
}

Currently queries take about 10-15 seconds, but ofter hit 75s (nginx
timeout). We've had issues with failed merges resulting in shards that had
huge segment counts. I use Lucene's CheckIndex to "-fix" these issues. I
then _optimised to 1 segment out of curiosity to see how much this affected
performance. Search times were decreased to around the 1-5s mark. Great,
but the process caused huge loads for about 6-8 hours. To try to keep the
segment counts low I set optimize to run daily with a segment count of 3.
Again there was a lot of instability.

I've keep details brief, assuming that the gist would probably highlight
obvious wtf moments that will be highlighted. Really what I want to know,
in no particular order, is:

(0. What sounds ridiculous in the above.)

Is it possible to get a breakdown of query execution (ie. took this
long executing on shard x, it was merging at the time)

What's a good strategy for keeping segment count down:

Without killing the cluster. There are a lot of settings to
throttle merges that sounds applicable, my concern is that just any merge
is enough to cause massive query times.

Does this sound like something I should concentrate on?

perhaps more frequent merges will cause shorter freezes.

Is optimize something you should expect to have to run, or is there
something wrong with the setup

Does the shard count sound "out there" for the doc count/size/etc

How do I optimize the heap/file system cache balance (when do I
allocate more to the JVM, file system cache), does it sound like this would
help?

How do other people go about profiling these types of issues

Details on request/interest.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

hazzadous · June 4, 2013, 2:38pm

One more omission, running ES 0.20.4

On Tuesday, 4 June 2013 14:50:20 UTC+1, hazzadous wrote:

A few omission from the previous post:

Running ubuntu 12.04 and OpenJDK (IcedTea 2.3.9).

I'm using a refresh interval of 3600s although refresh is called
explicitly when required, about every 10 mins.

On Monday, 3 June 2013 23:47:04 UTC+1, hazzadous wrote:

We're experiencing large variance in query times that I'm not sure how to
diagnose. Our setup is as follows:

12 nodes (hexcore hyperthreaded, 64GB memory, 2x 3TB in RAID0 config)
One index, 200 shards, 1 replica. ~20TB including replicas. ~160m docs.
32GB JVM heap
Indexing ~150 docs/s on average. Load ~1.5. Documents are a break

Aside from index and bulk threadpools (set to core counts, blocking)
everything else is default.

Docs are ~as follows:

{
"site": long,
"countries": long,
"text": {"standard": "string", "en": "string", "ru": "string", ... for
all available analyzers, only indexed detected doc language},
"publication_date": date,
... other longs and non analyzed terms
}

Queries ~are:

{
"query_string": {"query": "...", "fields": ["text.standard", ...]},
"facets": {
"site": term facet,
"countries": term facet,
"publication_date" histogram
},
"range_filter": on publication date,
"term_filter": on sites and countries
}

Currently queries take about 10-15 seconds, but ofter hit 75s (nginx
timeout). We've had issues with failed merges resulting in shards that had
huge segment counts. I use Lucene's CheckIndex to "-fix" these issues. I
then _optimised to 1 segment out of curiosity to see how much this affected
performance. Search times were decreased to around the 1-5s mark. Great,
but the process caused huge loads for about 6-8 hours. To try to keep the
segment counts low I set optimize to run daily with a segment count of 3.
Again there was a lot of instability.

I've keep details brief, assuming that the gist would probably highlight
obvious wtf moments that will be highlighted. Really what I want to know,
in no particular order, is:

(0. What sounds ridiculous in the above.)

Is it possible to get a breakdown of query execution (ie. took this
long executing on shard x, it was merging at the time)

What's a good strategy for keeping segment count down:

Without killing the cluster. There are a lot of settings to
throttle merges that sounds applicable, my concern is that just any merge
is enough to cause massive query times.

Does this sound like something I should concentrate on?

perhaps more frequent merges will cause shorter freezes.

Is optimize something you should expect to have to run, or is
there something wrong with the setup

Does the shard count sound "out there" for the doc count/size/etc

How do I optimize the heap/file system cache balance (when do I
allocate more to the JVM, file system cache), does it sound like this would
help?

How do other people go about profiling these types of issues

Details on request/interest.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

brian_yoder · June 4, 2013, 4:11pm

OpenJDK (IcedTea) != Java. Any time spent running ES on OpenJDK is time
wasted.

Install and use real Oracle (aka Sun) Java. Works fine on Ubuntu too (which
is the way I run ES on all systems: Mac, Ubuntu, and Solaris x86-64).

Brian

On Tuesday, June 4, 2013 9:50:20 AM UTC-4, hazzadous wrote:

A few omission from the previous post:

Running ubuntu 12.04 and OpenJDK (IcedTea 2.3.9).

I'm using a refresh interval of 3600s although refresh is called
explicitly when required, about every 10 mins.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

hazzadous · June 4, 2013, 11:33pm

Thanks Brian will give it a try.

On Tuesday, 4 June 2013 17:11:35 UTC+1, InquiringMind wrote:

OpenJDK (IcedTea) != Java. Any time spent running ES on OpenJDK is time
wasted.

Install and use real Oracle (aka Sun) Java. Works fine on Ubuntu too
(which is the way I run ES on all systems: Mac, Ubuntu, and Solaris x86-64).

Brian

On Tuesday, June 4, 2013 9:50:20 AM UTC-4, hazzadous wrote:

A few omission from the previous post:

Running ubuntu 12.04 and OpenJDK (IcedTea 2.3.9).

I'm using a refresh interval of 3600s although refresh is called
explicitly when required, about every 10 mins.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Query timings breakdown and performance issues Elasticsearch	4	363	July 6, 2017
Huge variance in query time Elasticsearch	5	416	July 6, 2017
Latency spike after big merge Elasticsearch	2	1153	April 14, 2018
Optimize api working inconsistently...a bug Elasticsearch	3	799	July 6, 2017
Elasticsearch 5 - Large (size) query performance with complex mapping Elasticsearch	2	1075	July 18, 2017

Large variance in query times

Related topics