Increase in CPU usage and indexing time after upgrade from 0.19.2 -> 0.19.10

Hi All,

We've seen a jump in CPU usage after upgrading the cluster from 0.19.2 to
0.19.10

Background info:

  • 3 node cluster
  • 5 shards, 1 replica
  • ~33 million documents, ~32gb in size
  • ~110k/80k new/deleted documents/day
  • ~300-500k searches/day

Each node is allocated 4gb of jvm heap.

We upgraded the cluster Sunday night. Full shutdown and restart of all
nodes.

Our busy days are weekdays and for high level comparison purposes, traffic
is the same. Comparing our metrics archive of last Friday(0.19.2) to
Monday(0.19.10) we see the following:

Ratio Threshold: > 2.00 or < 0.500
20130405.0.bz2 20130408.0 Ratio Metric-Instance

  • 7927 293607 37.04
    elasticsearch.nodes.indices.merges.current_size_in_bytes ["es3"]*
  •   0.453       11.99   26.48   
    

elasticsearch.nodes.indices.merges.current_docs ["es3"]*
0.001 0.019 19.00 disk.dev.read ["xvda"]
0.019 0.233 12.26 disk.dev.read_bytes ["xvda"]

  •   0.178        1.004   5.64   
    

elasticsearch.nodes.process.cpu.user_in_millis ["es2"]*

  •  20.79       106.9     5.14   elasticsearch.nodes.process.cpu.percent 
    

["es2"]*

  •   0.001        0.005   5.00   
    

elasticsearch.nodes.indices.indexing.index_time_in_millis ["es2"]*
0.013 0.062 4.77 network.tcp.retranssegs

  •   0.530        1.646   3.11   kernel.all.load ["15 minute"]*
    
  •   0.001        0.003   3.00   
    

elasticsearch.nodes.indices.indexing.index_time_in_millis ["es1"]*

  •   0.001        0.003   3.00   
    

elasticsearch.nodes.indices.indexing.index_time_in_millis ["es3"]*

  •   0.559        1.669   2.99   kernel.all.load ["5 minute"]*
    
  •   0.081        0.242   2.99   kernel.percpu.cpu.user ["cpu1"]*
    
  •   0.166        0.495   2.98   
    

elasticsearch.nodes.process.cpu.user_in_millis ["es3"]*

  •   0.567        1.678   2.96   kernel.all.load ["1 minute"]*
    
  •  18.22        53.04    2.91   elasticsearch.nodes.process.cpu.percent 
    

["es3"]*

  •   0.177        0.507   2.86   kernel.all.cpu.user*
    
  •   0.096        0.265   2.76   kernel.percpu.cpu.user ["cpu0"]*
     0.466        1.229   2.64   mem.vmstat.pgdeactivate
     0.004        0.010   2.50   
    

elasticsearch.nodes.jvm.gc.collectors.ParNew.collection_time_in_millis
["es2"]
0.004 0.010 2.50
elasticsearch.nodes.jvm.gc.collection_time_in_millis ["es2"]
0.002 0.005 2.50 mem.vmstat.pgmajfault
0.026 0.058 2.23
elasticsearch.nodes.jvm.gc.collectors.ParNew.collection_time_in_millis
["es3"]
0.026 0.058 2.23
elasticsearch.nodes.jvm.gc.collection_time_in_millis ["es3"]
0.727 1.533 2.11
elasticsearch.nodes.jvm.gc.collectors.ParNew.collection_count ["es3"]
0.727 1.533 2.11
elasticsearch.nodes.jvm.gc.collection_count ["es3"]

1st column is Friday, 2nd column is Monday, 3rd is the ratio difference and
the 4th is metric.

So, we are seeing a 3-5 increase in indexing time, and a corresponding
increase in CPU time. This is coupled with a huge jump in merges (the top
two lines).

We are wondering if there is any change in segment merge policy between the
versions? (0.19.2 and 0.19.10) ? I see that 0.19.3 included a lucene
version change, but I can't find much other info about the change.

Any ideas about possible causes/solutions?

Cheers,

Owen Butler

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Owen

We've seen a jump in CPU usage after upgrading the cluster from 0.19.2
to 0.19.10

I can't say much about what may be causing the change in CPU usage but
why upgrade to 0.19.10? It's now a really old version.

0.20.6 is the recommended stable version, and we're about to (next few
weeks) release 0.90 stable, which is based on Lucene 4.

Lots has changed since 0.19.10 - I don't think you're going to find
anybody who will spend lots of time debugging those changes.

clint

Background info:

  • 3 node cluster
  • 5 shards, 1 replica
  • ~33 million documents, ~32gb in size
  • ~110k/80k new/deleted documents/day
  • ~300-500k searches/day

Each node is allocated 4gb of jvm heap.

We upgraded the cluster Sunday night. Full shutdown and restart of
all nodes.

Our busy days are weekdays and for high level comparison purposes,
traffic is the same. Comparing our metrics archive of last
Friday(0.19.2) to Monday(0.19.10) we see the following:

Ratio Threshold: > 2.00 or < 0.500

20130405.0.bz2 20130408.0 Ratio Metric-Instance
7927 293607 37.04
elasticsearch.nodes.indices.merges.current_size_in_bytes ["es3"]
0.453 11.99 26.48
elasticsearch.nodes.indices.merges.current_docs ["es3"]
0.001 0.019 19.00 disk.dev.read ["xvda"]
0.019 0.233 12.26 disk.dev.read_bytes ["xvda"]
0.178 1.004 5.64
elasticsearch.nodes.process.cpu.user_in_millis ["es2"]
20.79 106.9 5.14
elasticsearch.nodes.process.cpu.percent ["es2"]
0.001 0.005 5.00
elasticsearch.nodes.indices.indexing.index_time_in_millis ["es2"]
0.013 0.062 4.77 network.tcp.retranssegs
0.530 1.646 3.11 kernel.all.load ["15 minute"]
0.001 0.003 3.00
elasticsearch.nodes.indices.indexing.index_time_in_millis ["es1"]
0.001 0.003 3.00
elasticsearch.nodes.indices.indexing.index_time_in_millis ["es3"]
0.559 1.669 2.99 kernel.all.load ["5 minute"]
0.081 0.242 2.99 kernel.percpu.cpu.user ["cpu1"]
0.166 0.495 2.98
elasticsearch.nodes.process.cpu.user_in_millis ["es3"]
0.567 1.678 2.96 kernel.all.load ["1 minute"]
18.22 53.04 2.91
elasticsearch.nodes.process.cpu.percent ["es3"]
0.177 0.507 2.86 kernel.all.cpu.user
0.096 0.265 2.76 kernel.percpu.cpu.user ["cpu0"]
0.466 1.229 2.64 mem.vmstat.pgdeactivate
0.004 0.010 2.50
elasticsearch.nodes.jvm.gc.collectors.ParNew.collection_time_in_millis
["es2"]
0.004 0.010 2.50
elasticsearch.nodes.jvm.gc.collection_time_in_millis ["es2"]
0.002 0.005 2.50 mem.vmstat.pgmajfault
0.026 0.058 2.23
elasticsearch.nodes.jvm.gc.collectors.ParNew.collection_time_in_millis
["es3"]
0.026 0.058 2.23
elasticsearch.nodes.jvm.gc.collection_time_in_millis ["es3"]
0.727 1.533 2.11
elasticsearch.nodes.jvm.gc.collectors.ParNew.collection_count ["es3"]
0.727 1.533 2.11
elasticsearch.nodes.jvm.gc.collection_count ["es3"]

1st column is Friday, 2nd column is Monday, 3rd is the ratio
difference and the 4th is metric.

So, we are seeing a 3-5 increase in indexing time, and a corresponding
increase in CPU time. This is coupled with a huge jump in merges (the
top two lines).

We are wondering if there is any change in segment merge policy
between the versions? (0.19.2 and 0.19.10) ? I see that 0.19.3
included a lucene version change, but I can't find much other info
about the change.

Any ideas about possible causes/solutions?

Cheers,

Owen Butler

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

yeah, we realise it's old, but because of some legacy dependency, this app
needs to use Lucene 2.4 (how's that for legacy), which conflicts with the
Lucene shipped with ES, this causes a incompatible class loading behaviour
using the default Node client - I think the Explain option within a query
triggers dependency on that Lucene.

We're forced to use the TransportClient just to work around that at the
moment. However, since 0.20, even THIS doesn't work anymore and triggers
the class bug even with the TransportClient.

We'll have to come up with some extra tricks to work around this to allow
the co-location of the legacy vs ES platform. It's an incremental
migration from the legacy to ES, just to reduce risk.

If someone can help decouple the TransportClient from triggering that
behaviour, we'll upgrade pretty quickly! :slight_smile:

cheers,

Paul (&Owen)

On 12 April 2013 19:59, Clinton Gormley clint@traveljury.com wrote:

Hi Owen

We've seen a jump in CPU usage after upgrading the cluster from 0.19.2
to 0.19.10

I can't say much about what may be causing the change in CPU usage but
why upgrade to 0.19.10? It's now a really old version.

0.20.6 is the recommended stable version, and we're about to (next few
weeks) release 0.90 stable, which is based on Lucene 4.

Lots has changed since 0.19.10 - I don't think you're going to find
anybody who will spend lots of time debugging those changes.

clint

Background info:

  • 3 node cluster
  • 5 shards, 1 replica
  • ~33 million documents, ~32gb in size
  • ~110k/80k new/deleted documents/day
  • ~300-500k searches/day

Each node is allocated 4gb of jvm heap.

We upgraded the cluster Sunday night. Full shutdown and restart of
all nodes.

Our busy days are weekdays and for high level comparison purposes,
traffic is the same. Comparing our metrics archive of last
Friday(0.19.2) to Monday(0.19.10) we see the following:

Ratio Threshold: > 2.00 or < 0.500

20130405.0.bz2 20130408.0 Ratio Metric-Instance
7927 293607 37.04
elasticsearch.nodes.indices.merges.current_size_in_bytes ["es3"]
0.453 11.99 26.48
elasticsearch.nodes.indices.merges.current_docs ["es3"]
0.001 0.019 19.00 disk.dev.read ["xvda"]
0.019 0.233 12.26 disk.dev.read_bytes ["xvda"]
0.178 1.004 5.64
elasticsearch.nodes.process.cpu.user_in_millis ["es2"]
20.79 106.9 5.14
elasticsearch.nodes.process.cpu.percent ["es2"]
0.001 0.005 5.00
elasticsearch.nodes.indices.indexing.index_time_in_millis ["es2"]
0.013 0.062 4.77 network.tcp.retranssegs
0.530 1.646 3.11 kernel.all.load ["15 minute"]
0.001 0.003 3.00
elasticsearch.nodes.indices.indexing.index_time_in_millis ["es1"]
0.001 0.003 3.00
elasticsearch.nodes.indices.indexing.index_time_in_millis ["es3"]
0.559 1.669 2.99 kernel.all.load ["5 minute"]
0.081 0.242 2.99 kernel.percpu.cpu.user ["cpu1"]
0.166 0.495 2.98
elasticsearch.nodes.process.cpu.user_in_millis ["es3"]
0.567 1.678 2.96 kernel.all.load ["1 minute"]
18.22 53.04 2.91
elasticsearch.nodes.process.cpu.percent ["es3"]
0.177 0.507 2.86 kernel.all.cpu.user
0.096 0.265 2.76 kernel.percpu.cpu.user ["cpu0"]
0.466 1.229 2.64 mem.vmstat.pgdeactivate
0.004 0.010 2.50
elasticsearch.nodes.jvm.gc.collectors.ParNew.collection_time_in_millis
["es2"]
0.004 0.010 2.50
elasticsearch.nodes.jvm.gc.collection_time_in_millis ["es2"]
0.002 0.005 2.50 mem.vmstat.pgmajfault
0.026 0.058 2.23
elasticsearch.nodes.jvm.gc.collectors.ParNew.collection_time_in_millis
["es3"]
0.026 0.058 2.23
elasticsearch.nodes.jvm.gc.collection_time_in_millis ["es3"]
0.727 1.533 2.11
elasticsearch.nodes.jvm.gc.collectors.ParNew.collection_count ["es3"]
0.727 1.533 2.11
elasticsearch.nodes.jvm.gc.collection_count ["es3"]

1st column is Friday, 2nd column is Monday, 3rd is the ratio
difference and the 4th is metric.

So, we are seeing a 3-5 increase in indexing time, and a corresponding
increase in CPU time. This is coupled with a huge jump in merges (the
top two lines).

We are wondering if there is any change in segment merge policy
between the versions? (0.19.2 and 0.19.10) ? I see that 0.19.3
included a lucene version change, but I can't find much other info
about the change.

Any ideas about possible causes/solutions?

Cheers,

Owen Butler

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.