Geo_distance performance problem fixed by merging segments

Hi all,

I am currently seeing a problem where performance drops dramatically (from approximately 20ms to 300ms) when I do a geo_distance search in the 250-430 mile range. Any distance greater than that or smaller than that is fine (sub-100ms), but that particular range is problematic.

I have been able to make the problem go away by using force merge to reduce the number of total segments to about 5 per shard, but periodic changes written to the index (every 30 minutes) cause the number of segments to drift back up to around 25 per shard, at which point we start seeing the moderate geo_distance searches start performing poorly again.

The problem is definitely due to the geo_distance query fragment:

  • Adjusting the distance out of the problem range, or removing the fragment altogether, resolves the performance problem. Problem recurs when putting the distance back in the problem range.
  • Profiling shows most of the time is spent mostly in the GeoPointTermQueryConstantScoreWrapper. The timing is consistently slow across all nodes (i.e., it is not a single problem node).

I have been able to reproduce this on multiple clusters (dev/test/production/local workstation).

Relevant info:

  • We are running Elasticsearch 2.3.2 with 3 master / 3 query / 5 data nodes / 5 shards with 1 replica (10 shards total).
    • This is also reproducible on a cluster with just 5 undifferentiated nodes (5 shards with 1 replica).
  • The index contains 7M records and is written to periodically (every 30 minutes or so). About half the records will change over the course of a day, some batches larger than others.
  • We do two types of updates:
    • Full record upsert (several _bulk operations) + delete of older documents (using delete-by-query)
    • Update date fields on the record (_bulk operations).
      NOTE: at most each record is touched once every 4 hours (usually once/day)
  • The geo_point field queried is declared using default settings (just { "type": "geo_point" })
  • The index is using default settings except for index.max_results_window being set to 50000.
  • On my local workstation I tested with 2.3.5 and with the default index.max_results_window (10000), so I do not believe this setting to be the cause of the problem.

My questions are:

  1. Is there a way to tune Elasticsearch so queries not perform so poorly for distances in this range?
  2. Is there a way to make the cluster perform automatic merges more frequently or should I force merge?
  3. It seems to be considered ill-advised to run frequent force merges on an index. Is that still the case when doing bulk writes on a 30-minute basis?

Thank you for your assistance!

# Chris

Hi all,

I have done some further investigation, and have determined that the problem appears to be related to issue #18874, which is resolved in 5.0. I have been able to confirm that 5.0 resolves the problem we were seeing (at least on my workstation) on indexes created in 2.3 or 2.4.

Our focus now is on determining whether doing periodic force merges for the short-term every 3-4 hours will mitigate the high CPU usage we see or exacerbate it. In any event we are planning on upgrading to 5.0 at some point, either immediately (if the force merges do not help or make things worse), or within the next 2-3 months (if they do).

Thanks for a great product!

# Chris