Elastic Search Random Node High Load


(djovic) #1

Issue previously opened here:


Re-posted on the mailing list, as requested:

We have a problem with our Elastic Search cluster in every environment. The
cluster will sometimes get into an unstable state, wherein certain ES nodes
will have load that is several times greater than the load on other nodes.

This can be reproduced every time very quickly by hitting the cluster with
about 40 concurrent threads while running data indexing, and much less
quickly by simply running constant searches.

Our cluster setup:

  • 4 no-data client nodes that service search requests
  • 1 no-data client node that sends data indexing requests.
  • 10 data nodes that are called by the 5 for indexing and searching.

We have verified that all the boxes have:

  • the same hardware
  • CPU Intel Xeon X5550 (2.67GHz, 64-bit, 16 core)
  • 96GB high-end physical RAM (64GB JVM heap, 20 GB mem baseline).
  • the same OS
  • Red Hat Enterprise Linux Server release 5.9 (Tikanga)
  • Linux version 2.6.18-348.1.1.el5 x86_64
  • the same JVM
  • Sun/Oracle 1.7.0_09 (64-bit)
  • the same ES version
  • Currently on 0.20.4
  • Problem existed since 0.19.8, when we started with ES
  • no other software running
  • the same number of shards
  • 1 product shard per node
  • 13M product documents
  • (product schema has thousands of fields).
  • 1 entity shard per node
  • 135M entity documents
  • (dozens of entity data types, each with several fields).
  • approximately the same amount of data
  • product index is 25GB on disk.
  • entity index is 12GB on disk.
  • All data loads into about 20GB baseline RAM.
  • the exact same logical configuration
    {
    product: {
    settings: {
    index.translog.flush_threshold_size: 500mb
    index.refresh_interval: 30s
    index.number_of_replicas: 1
    index.translog.disable_flush: false
    index.version.created: 190999
    index.number_of_shards: 5
    index.routing.allocation.total_shards_per_node: 1
    index.translog.flush_threshold_period: 60m
    index.translog.flush_threshold_ops: 5000
    }
    }
    entity: {
    settings: {
    index.translog.flush_threshold_size: 500mb
    index.refresh_interval: 30s
    index.number_of_replicas: 1
    index.translog.disable_flush: false
    index.version.created: 190999
    index.number_of_shards: 5
    index.routing.allocation.total_shards_per_node: 1
    index.translog.flush_threshold_period: 60m
    index.translog.flush_threshold_ops: 5000
    }
    }
    }

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Otis Gospodnetić) #2

Hi,

I took a look at the ZIP, but there is only a load graph there. Are you
graphing all other ES metrics? If so, look at them and see which ones are
different. If not, see my signature. If all hardware, etc. specs are the
same, something is going to be causing that load and the load is just a
symptom. We had a client the other day whose load would go up every few
hours. The query rate was mostly continuous. So why would the load (and
query latency) go up!? It turned out the proportion of the more expensive
queries was higher during those high load/latency times. But before that
we looked at all the metrics and eliminated everything else we could
eliminate. :slight_smile:

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html

On Thursday, June 20, 2013 7:34:50 PM UTC-4, djovic wrote:

Issue previously opened here:
https://github.com/elasticsearch/elasticsearch/issues/3208
Re-posted on the mailing list, as requested:

We have a problem with our Elastic Search cluster in every environment.
The cluster will sometimes get into an unstable state, wherein certain ES
nodes will have load that is several times greater than the load on other
nodes.

This can be reproduced every time very quickly by hitting the cluster with
about 40 concurrent threads while running data indexing, and much less
quickly by simply running constant searches.

Our cluster setup:

  • 4 no-data client nodes that service search requests
  • 1 no-data client node that sends data indexing requests.
  • 10 data nodes that are called by the 5 for indexing and searching.

We have verified that all the boxes have:

  • the same hardware
  • CPU Intel Xeon X5550 (2.67GHz, 64-bit, 16 core)
  • 96GB high-end physical RAM (64GB JVM heap, 20 GB mem baseline).
  • the same OS
  • Red Hat Enterprise Linux Server release 5.9 (Tikanga)
  • Linux version 2.6.18-348.1.1.el5 x86_64
  • the same JVM
  • Sun/Oracle 1.7.0_09 (64-bit)
  • the same ES version
  • Currently on 0.20.4
  • Problem existed since 0.19.8, when we started with ES
  • no other software running
  • the same number of shards
  • 1 product shard per node
  • 13M product documents
  • (product schema has thousands of fields).
  • 1 entity shard per node
  • 135M entity documents
  • (dozens of entity data types, each with several fields).
  • approximately the same amount of data
  • product index is 25GB on disk.
  • entity index is 12GB on disk.
  • All data loads into about 20GB baseline RAM.
  • the exact same logical configuration
    {
    product: {
    settings: {
    index.translog.flush_threshold_size: 500mb
    index.refresh_interval: 30s
    index.number_of_replicas: 1
    index.translog.disable_flush: false
    index.version.created: 190999
    index.number_of_shards: 5
    index.routing.allocation.total_shards_per_node: 1
    index.translog.flush_threshold_period: 60m
    index.translog.flush_threshold_ops: 5000
    }
    }
    entity: {
    settings: {
    index.translog.flush_threshold_size: 500mb
    index.refresh_interval: 30s
    index.number_of_replicas: 1
    index.translog.disable_flush: false
    index.version.created: 190999
    index.number_of_shards: 5
    index.routing.allocation.total_shards_per_node: 1
    index.translog.flush_threshold_period: 60m
    index.translog.flush_threshold_ops: 5000
    }
    }
    }

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(djovic) #3

This problem was resolved with the upgrade to 0.90.x.

DJ

On Thursday, June 20, 2013 4:34:50 PM UTC-7, djovic wrote:

Issue previously opened here:
https://github.com/elasticsearch/elasticsearch/issues/3208
Re-posted on the mailing list, as requested:

We have a problem with our Elastic Search cluster in every environment.
The cluster will sometimes get into an unstable state, wherein certain ES
nodes will have load that is several times greater than the load on other
nodes.

This can be reproduced every time very quickly by hitting the cluster with
about 40 concurrent threads while running data indexing, and much less
quickly by simply running constant searches.

Our cluster setup:

  • 4 no-data client nodes that service search requests
  • 1 no-data client node that sends data indexing requests.
  • 10 data nodes that are called by the 5 for indexing and searching.

We have verified that all the boxes have:

  • the same hardware
  • CPU Intel Xeon X5550 (2.67GHz, 64-bit, 16 core)
  • 96GB high-end physical RAM (64GB JVM heap, 20 GB mem baseline).
  • the same OS
  • Red Hat Enterprise Linux Server release 5.9 (Tikanga)
  • Linux version 2.6.18-348.1.1.el5 x86_64
  • the same JVM
  • Sun/Oracle 1.7.0_09 (64-bit)
  • the same ES version
  • Currently on 0.20.4
  • Problem existed since 0.19.8, when we started with ES
  • no other software running
  • the same number of shards
  • 1 product shard per node
  • 13M product documents
  • (product schema has thousands of fields).
  • 1 entity shard per node
  • 135M entity documents
  • (dozens of entity data types, each with several fields).
  • approximately the same amount of data
  • product index is 25GB on disk.
  • entity index is 12GB on disk.
  • All data loads into about 20GB baseline RAM.
  • the exact same logical configuration
    {
    product: {
    settings: {
    index.translog.flush_threshold_size: 500mb
    index.refresh_interval: 30s
    index.number_of_replicas: 1
    index.translog.disable_flush: false
    index.version.created: 190999
    index.number_of_shards: 5
    index.routing.allocation.total_shards_per_node: 1
    index.translog.flush_threshold_period: 60m
    index.translog.flush_threshold_ops: 5000
    }
    }
    entity: {
    settings: {
    index.translog.flush_threshold_size: 500mb
    index.refresh_interval: 30s
    index.number_of_replicas: 1
    index.translog.disable_flush: false
    index.version.created: 190999
    index.number_of_shards: 5
    index.routing.allocation.total_shards_per_node: 1
    index.translog.flush_threshold_period: 60m
    index.translog.flush_threshold_ops: 5000
    }
    }
    }

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4