Significant performance deterioration as of 0.90.x


(Oli McCormack) #1

Hi folks,

We've recently tried to upgrade to 0.90.5 and have noticed a huge drop in
geo_shape query performance. As well as poor query latency, the volume of
resources required to answer simple geospatial operations has a huge
knock-on impact to other query types.

For our examples, we're executing a point lookup against a set of <100
documents representing geometries in the US. The documents exist in an
index with many other docs (~100mn), but are defined as a specific type in
the index, with an appropriate mapping and tree level. For these examples,
all metrics were taken from executing 50 queries per sample, ~4
concurrently. I ran two clusters (one for 0.20.4 and 0.90.5), both clusters
have the same number of nodes, and the 0.90.5 cluster has about 2/3 the
number of docs. Latency is measured using the "took" metric in the ES JSON
response.

Basic perf for searching for a point in a doc (query sample below):

version 50-percentile (ms) 90-percentile (ms) sample 1 0.20.4 98.5
145.9sample 20.20.490.5183.2 sample 10.90.53685.56211.9sample
20.90.535885366.3

Aside from the slower performance here, the impact on other ongoing queries
is significant. Whilst these queries are running, this is the latency we
see for very simple term lookups:

version 50-percentile (ms) 90-percentile (ms) term query 0.20.4 27
65term with ongoing point queries0.20.434116.5 term
query0.90.552.5105.8term with ongoing point queries
0.90.5 1745.5 3811.8

During testing, I also tried a few things on a single box. Here are some
things I observed:

  • For point queries, CPU is entirely pegged to achieve results.
  • On a bigger box with twice the CPUs, latency dropped to about 60% of
    what we see here.
  • No obvious memory constraints.
  • When I executed 10 point queries, 1456mb was loaded into disk cache,
    compared to 8mb when a separate small index that had only the state
    documents.
  • I ran some tests loading to a single shard, there seems to be a point
    where performance dropped a lot. Specifically:
    • between 250k - 400k, where latency would drop 10x (see included
      graph at end of message)
    • before this point, perf was actually quite reasonable on the small
      index

Notes on the configuration:

  • 7 nodes, m1.large
  • 1 replica
  • 1 index, ~100mn docs
  • geo mapping: { "type" : "geo_shape", "tree" : "quadtree",
    "tree_levels" : 9, "distance_error_pct" : 0.0 }
  • query sample: {"constant_score": {"boost": 1, "filter": {"geo_shape":
    {"geometry": {"shape": {"type": "Point", "coordinates": [, ]}
    "relation": "intersects"}}}}}

I'm extremely confused about why we're seeing this performance difference,
especially after a version upgrade and a reduction in index size, it's
blocking our migration. We noticed none of these issue with our previous
cluster, and have completed an ingest of the exact same mappings &
documents - except less of them - to the new cluster.

I would be very interested to hear about any solutions to or reasons for
this problem, and am more than happy to investigate further angles if
people have suggestions.

Cheers,
Oli

[image: Inline image 1]

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(chilling) #2

Hi Oli,

sorry for taking so long. I currently looking at this issue and try to find
out what out changes were made causing this change. I'll keep you up to
date.

Thanks for pointing this out,
Florian

On Tuesday, November 26, 2013 10:52:51 AM UTC+9, Oli wrote:

Hi folks,

We've recently tried to upgrade to 0.90.5 and have noticed a huge drop in
geo_shape query performance. As well as poor query latency, the volume of
resources required to answer simple geospatial operations has a huge
knock-on impact to other query types.

For our examples, we're executing a point lookup against a set of <100
documents representing geometries in the US. The documents exist in an
index with many other docs (~100mn), but are defined as a specific type in
the index, with an appropriate mapping and tree level. For these examples,
all metrics were taken from executing 50 queries per sample, ~4
concurrently. I ran two clusters (one for 0.20.4 and 0.90.5), both clusters
have the same number of nodes, and the 0.90.5 cluster has about 2/3 the
number of docs. Latency is measured using the "took" metric in the ES JSON
response.

Basic perf for searching for a point in a doc (query sample below):

version 50-percentile (ms) 90-percentile (ms) sample 1 0.20.4 98.5 145.9sample 20.20.490.5183.2 sample 10.90.53685.56211.9sample 20.90.535885366.3

Aside from the slower performance here, the impact on other ongoing
queries is significant. Whilst these queries are running, this is the
latency we see for very simple term lookups:

version 50-percentile (ms) 90-percentile (ms) term query 0.20.4 27 65term with ongoing point queries0.20.434116.5 term query0.90.552.5105.8term with ongoing point queries
0.90.5 1745.5 3811.8

During testing, I also tried a few things on a single box. Here are some
things I observed:

  • For point queries, CPU is entirely pegged to achieve results.
  • On a bigger box with twice the CPUs, latency dropped to about 60% of
    what we see here.
  • No obvious memory constraints.
  • When I executed 10 point queries, 1456mb was loaded into disk cache,
    compared to 8mb when a separate small index that had only the state
    documents.
  • I ran some tests loading to a single shard, there seems to be a
    point where performance dropped a lot. Specifically:
    • between 250k - 400k, where latency would drop 10x (see included
      graph at end of message)
    • before this point, perf was actually quite reasonable on the
      small index

Notes on the configuration:

  • 7 nodes, m1.large
  • 1 replica
  • 1 index, ~100mn docs
  • geo mapping: { "type" : "geo_shape", "tree" : "quadtree",
    "tree_levels" : 9, "distance_error_pct" : 0.0 }
  • query sample: {"constant_score": {"boost": 1, "filter":
    {"geo_shape": {"geometry": {"shape": {"type": "Point", "coordinates":
    [, ]} "relation": "intersects"}}}}}

I'm extremely confused about why we're seeing this performance difference,
especially after a version upgrade and a reduction in index size, it's
blocking our migration. We noticed none of these issue with our previous
cluster, and have completed an ingest of the exact same mappings &
documents - except less of them - to the new cluster.

I would be very interested to hear about any solutions to or reasons for
this problem, and am more than happy to investigate further angles if
people have suggestions.

Cheers,
Oli

[image: Inline image 1]

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3591ef47-b089-4cc6-9e02-4a188bf98e74%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Oli McCormack) #3

Hey Florian,

Thanks a lot, I appreciate you investigating this. We're still blocked by
the issue (holding off on moving from 0.20.4), but I'm very happy to help
with debugging if there are any suggestions you have.

Oli

On Wed, Dec 11, 2013 at 12:37 AM, Florian Schilling <
florian.schilling@elasticsearch.com> wrote:

Hi Oli,

sorry for taking so long. I currently looking at this issue and try to
find out what out changes were made causing this change. I'll keep you up
to date.

Thanks for pointing this out,
Florian

On Tuesday, November 26, 2013 10:52:51 AM UTC+9, Oli wrote:

Hi folks,

We've recently tried to upgrade to 0.90.5 and have noticed a huge drop in
geo_shape query performance. As well as poor query latency, the volume of
resources required to answer simple geospatial operations has a huge
knock-on impact to other query types.

For our examples, we're executing a point lookup against a set of <100
documents representing geometries in the US. The documents exist in an
index with many other docs (~100mn), but are defined as a specific type in
the index, with an appropriate mapping and tree level. For these examples,
all metrics were taken from executing 50 queries per sample, ~4
concurrently. I ran two clusters (one for 0.20.4 and 0.90.5), both clusters
have the same number of nodes, and the 0.90.5 cluster has about 2/3 the
number of docs. Latency is measured using the "took" metric in the ES JSON
response.

Basic perf for searching for a point in a doc (query sample below):

version 50-percentile (ms) 90-percentile (ms) sample 1 0.20.4 98.5 145.9sample 20.20.490.5183.2 sample 10.90.53685.56211.9sample 20.90.535885366.3

Aside from the slower performance here, the impact on other ongoing
queries is significant. Whilst these queries are running, this is the
latency we see for very simple term lookups:

version 50-percentile (ms) 90-percentile (ms) term query 0.20.4 27 65term with ongoing point queries0.20.434116.5 term query0.90.552.5105.8term with ongoing point queries
0.90.5 1745.5 3811.8

During testing, I also tried a few things on a single box. Here are some
things I observed:

  • For point queries, CPU is entirely pegged to achieve results.
  • On a bigger box with twice the CPUs, latency dropped to about 60%
    of what we see here.
  • No obvious memory constraints.
  • When I executed 10 point queries, 1456mb was loaded into disk
    cache, compared to 8mb when a separate small index that had only the state
    documents.
  • I ran some tests loading to a single shard, there seems to be a
    point where performance dropped a lot. Specifically:
    • between 250k - 400k, where latency would drop 10x (see included
      graph at end of message)
    • before this point, perf was actually quite reasonable on the
      small index

Notes on the configuration:

  • 7 nodes, m1.large
  • 1 replica
  • 1 index, ~100mn docs
  • geo mapping: { "type" : "geo_shape", "tree" : "quadtree",
    "tree_levels" : 9, "distance_error_pct" : 0.0 }
  • query sample: {"constant_score": {"boost": 1, "filter":
    {"geo_shape": {"geometry": {"shape": {"type": "Point", "coordinates":
    [, ]} "relation": "intersects"}}}}}

I'm extremely confused about why we're seeing this performance
difference, especially after a version upgrade and a reduction in index
size, it's blocking our migration. We noticed none of these issue with our
previous cluster, and have completed an ingest of the exact same mappings &
documents - except less of them - to the new cluster.

I would be very interested to hear about any solutions to or reasons for
this problem, and am more than happy to investigate further angles if
people have suggestions.

Cheers,
Oli

[image: Inline image 1]

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/3591ef47-b089-4cc6-9e02-4a188bf98e74%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHnsjpveAe2GcOHjf1pw-d40QNtuPqyr8KGY4SUpLiBykvhORg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4