Heavy load on one node (1 index)

John_Cwikla · March 17, 2012, 2:10am

We have a 4 node cluster with 1 index, 25m docs or so. We have many
processes reading/writing to our cluster through
pyes which we've verified randomly chooses an initial node to talk to.

Strange thing is, 3 of the 4 nodes are bored (0.25 load), 1 node is taking
the brunt -
40-50 load (16 core).

We can't figure out why - any pointers? Anything I can do to debug what is
going on?

kimchy · March 17, 2012, 11:29am

Can you gist a stack trace on the busy node? You can use jstack to do so:
jstack - Stack Trace.

On Sat, Mar 17, 2012 at 4:10 AM, John Cwikla cwikla@fwix.com wrote:

Clinton_Gormley · March 17, 2012, 12:09pm

On Fri, 2012-03-16 at 19:10 -0700, John Cwikla wrote:

Is it continuous? or just for a period? If the latter, it was probably
doing a merge of big segments. So the load is actually IO, not CPU.

I've been seeing the same thing on one of my indexes. It starts to do a
single merge of GBs of data, and the machine is too overloaded to
respond to requests.

You can check if there is a merge happening by looking at the index
status:

clint

otisg · March 19, 2012, 9:35am

John,

Can you check if your shards are well balanced?
You may be able to tell what's going on using ES head and SPM for ES:

http://mobz.github.com/elasticsearch-head/
http://apps.sematext.com/

The former you run on your own servers, the latter is a service. Both are
free.

Otis

Hiring Elasticsearch Engineers World-Wide --

On Saturday, March 17, 2012 10:10:05 AM UTC+8, John Cwikla wrote:

John_Cwikla · March 19, 2012, 5:53pm

Thanks for all the pointers - I'll start diving into all these and let you
know what
I come up with.

On Monday, March 19, 2012 2:35:11 AM UTC-7, Otis Gospodnetic wrote:

John_Cwikla · March 19, 2012, 8:34pm

Elasticsearch head gave me some interesting results. So we have 3 indexes
on our 4 node cluster.

The first index is just a test, so ignore it.

The second is 25m docs (19G), and has 3 shards on 1 node and 1 on each other

The third is 16m docs (30G) and has 5 shards on 1, 4 on another, 2 on
another, and 1 on the
last.

The one with 5 shards is the one I was having the problem with. Seems like
I need to have
the nodes rebalance somehow to take advantage of all the hardware I have...

At this point our load is average across the machines, so I haven't used
jstack, but I will
next time it pops up.

On Friday, March 16, 2012 7:10:05 PM UTC-7, John Cwikla wrote:

John_Cwikla · March 21, 2012, 6:52pm

Hmm, so I've been looking around and saw that rebalancing replica shards
isn't something that can be done?
It is the case the primaries are balanced, but the replicas are overloaded
on 1 machine in a 4 machine cluster.

Is there something I need to do, or can do to make them rebalance?

On Friday, March 16, 2012 7:10:05 PM UTC-7, John Cwikla wrote:

On Friday, March 16, 2012 7:10:05 PM UTC-7, John Cwikla wrote:

On Friday, March 16, 2012 7:10:05 PM UTC-7, John Cwikla wrote:

John_Cwikla · March 21, 2012, 7:17pm

And here's my shard list:

{

index: {
- primary_size: 30.5gb
- primary_size_in_bytes: 32762267538
- size: 62.5gb
- size_in_bytes: 67185556471
  }
translog: {
- operations: 28487
  }
docs: {
- num_docs: 16378524
- max_doc: 24589536
- deleted_docs: 8211012
  }
merges: {
- current: 0
- current_docs: 0
- current_size: 0b
- current_size_in_bytes: 0
- total: 943760
- total_time: 1.5d
- total_time_in_millis: 129794974
- total_docs: 2083671491
- total_size: 2704.7gb
- total_size_in_bytes: 2904189840468
  }
refresh: {
- total: 8353127
- total_time: 1.9d
- total_time_in_millis: 171464206
  }
flush: {
- total: 131156
- total_time: 5.5h
- total_time_in_millis: 19850157
  }
shards: {
- 0: [
  - {
    - routing: {
      - state: STARTED
      - primary: false
      - node: nELxlGs6Tzii12JRCNxpGw
      - relocating_node: null
      - shard: 0
      - index: places
        }
    - state: STARTED
    - index: {
      - size: 4.9gb
      - size_in_bytes: 5352243736
        }
    - translog: {
      - id: 1326141078235
      - operations: 3211
        }
    - docs: {
      - num_docs: 2647990
      - max_doc: 4003535
      - deleted_docs: 1355545
        }
    - merges: {
      - current: 0
      - current_docs: 0
      - current_size: 0b
      - current_size_in_bytes: 0
      - total: 98373
      - total_time: 4.6h
      - total_time_in_millis: 16647902
      - total_docs: 213859847
      - total_size: 276.4gb
      - total_size_in_bytes: 296867190719
        }
    - refresh: {
      - total: 871137
      - total_time: 5.4h
      - total_time_in_millis: 19523847
        }
    - flush: {
      - total: 13594
      - total_time: 49.3m
      - total_time_in_millis: 2963485
        }
        }
  - {
    - routing: {
      - state: STARTED
      - primary: true
      - node: ab4uo_iyTw2WIT2-eb_F6Q
      - relocating_node: null
      - shard: 0
      - index: places
        }
    - state: STARTED
    - index: {
      - size: 4.9gb
      - size_in_bytes: 5319904685
        }
    - translog: {
      - id: 1326141078245
      - operations: 2483
        }
    - docs: {
      - num_docs: 2647990
      - max_doc: 3981357
      - deleted_docs: 1333367
        }
    - merges: {
      - current: 0
      - current_docs: 0
      - current_size: 0b
      - current_size_in_bytes: 0
      - total: 37155
      - total_time: 57.1m
      - total_time_in_millis: 3426101
      - total_docs: 78881622
      - total_size: 101.8gb
      - total_size_in_bytes: 109329911898
        }
    - refresh: {
      - total: 328715
      - total_time: 1.5h
      - total_time_in_millis: 5755051
        }
    - flush: {
      - total: 4796
      - total_time: 5.7m
      - total_time_in_millis: 344400
        }
        }
        ]
- 1: [
  - {
    - routing: {
      - state: STARTED
      - primary: false
      - node: ab4uo_iyTw2WIT2-eb_F6Q
      - relocating_node: null
      - shard: 1
      - index: places
        }
    - state: STARTED
    - index: {
      - size: 5.8gb
      - size_in_bytes: 6262820543
        }
    - translog: {
      - id: 1326141078607
      - operations: 56
        }
    - docs: {
      - num_docs: 2803977
      - max_doc: 4754923
      - deleted_docs: 1950946
        }
    - merges: {
      - current: 0
      - current_docs: 0
      - current_size: 0b
      - current_size_in_bytes: 0
      - total: 37336
      - total_time: 57.3m
      - total_time_in_millis: 3443330
      - total_docs: 79711670
      - total_size: 103.8gb
      - total_size_in_bytes: 111496656320
        }
    - refresh: {
      - total: 330123
      - total_time: 1.6h
      - total_time_in_millis: 5859695
        }
    - flush: {
      - total: 4921
      - total_time: 6m
      - total_time_in_millis: 360131
        }
        }
  - {
    - routing: {
      - state: STARTED
      - primary: true
      - node: nELxlGs6Tzii12JRCNxpGw
      - relocating_node: null
      - shard: 1
      - index: places
        }
    - state: STARTED
    - index: {
      - size: 5.8gb
      - size_in_bytes: 6296790810
        }
    - translog: {
      - id: 1326141078607
      - operations: 51
        }
    - docs: {
      - num_docs: 2803978
      - max_doc: 4786774
      - deleted_docs: 1982796
        }
    - merges: {
      - current: 0
      - current_docs: 0
      - current_size: 0b
      - current_size_in_bytes: 0
      - total: 99070
      - total_time: 4.8h
      - total_time_in_millis: 17343659
      - total_docs: 218051849
      - total_size: 283gb
      - total_size_in_bytes: 303888376066
        }
    - refresh: {
      - total: 876293
      - total_time: 5.8h
      - total_time_in_millis: 20913727
        }
    - flush: {
      - total: 13835
      - total_time: 53.2m
      - total_time_in_millis: 3192033
        }
        }
        ]
- 2: [
  - {
    - routing: {
      - state: STARTED
      - primary: true
      - node: ab4uo_iyTw2WIT2-eb_F6Q
      - relocating_node: null
      - shard: 2
      - index: places
        }
    - state: STARTED
    - index: {
      - size: 5.2gb
      - size_in_bytes: 5589513799
        }
    - translog: {
      - id: 1326141079557
      - operations: 2653
        }
    - docs: {
      - num_docs: 2737540
      - max_doc: 4189158
      - deleted_docs: 1451618
        }
    - merges: {
      - current: 0
      - current_docs: 0
      - current_size: 0b
      - current_size_in_bytes: 0
      - total: 37295
      - total_time: 1h
      - total_time_in_millis: 3602012
      - total_docs: 86544448
      - total_size: 111.6gb
      - total_size_in_bytes: 119899765102
        }
    - refresh: {
      - total: 328507
      - total_time: 1.6h
      - total_time_in_millis: 6103834
        }
    - flush: {
      - total: 5502
      - total_time: 6.5m
      - total_time_in_millis: 395750
        }
        }
  - {
    - routing: {
      - state: STARTED
      - primary: false
      - node: nELxlGs6Tzii12JRCNxpGw
      - relocating_node: null
      - shard: 2
      - index: places
        }
    - state: STARTED
    - index: {
      - size: 5.4gb
      - size_in_bytes: 5820645020
        }
    - translog: {
      - id: 1326141079550
      - operations: 4779
        }
    - docs: {
      - num_docs: 2737540
      - max_doc: 4368769
      - deleted_docs: 1631229
        }
    - merges: {
      - current: 0
      - current_docs: 0
      - current_size: 0b
      - current_size_in_bytes: 0
      - total: 98407
      - total_time: 4.9h
      - total_time_in_millis: 17647596
      - total_docs: 222704925
      - total_size: 289.4gb
      - total_size_in_bytes: 310757299702
        }
    - refresh: {
      - total: 870872
      - total_time: 5.6h
      - total_time_in_millis: 20188623
        }
    - flush: {
      - total: 14195
      - total_time: 52.5m
      - total_time_in_millis: 3154127
        }
        }
        ]
- 3: [
  - {
    - routing: {
      - state: STARTED
      - primary: true
      - node: T5_oKWa3RTO7osracRtPoQ
      - relocating_node: null
      - shard: 3
      - index: places
        }
    - state: STARTED
    - index: {
      - size: 4.9gb
      - size_in_bytes: 5343294260
        }
    - translog: {
      - id: 1326141078245
      - operations: 2124
        }
    - docs: {
      - num_docs: 2648125
      - max_doc: 3993220
      - deleted_docs: 1345095
        }
    - merges: {
      - current: 0
      - current_docs: 0
      - current_size: 0b
      - current_size_in_bytes: 0
      - total: 100192
      - total_time: 2.7h
      - total_time_in_millis: 9740764
      - total_docs: 216533766
      - total_size: 280.8gb
      - total_size_in_bytes: 301514882857
        }
    - refresh: {
      - total: 888382
      - total_time: 4.3h
      - total_time_in_millis: 15669718
        }
    - flush: {
      - total: 13576
      - total_time: 15.2m
      - total_time_in_millis: 917058
        }
        }
  - {
    - routing: {
      - state: STARTED
      - primary: false
      - node: ab4uo_iyTw2WIT2-eb_F6Q
      - relocating_node: null
      - shard: 3
      - index: places
        }
    - state: STARTED
    - index: {
      - size: 4.9gb
      - size_in_bytes: 5349875537
        }
    - translog: {
      - id: 1326141078241
      - operations: 3886
        }
    - docs: {
      - num_docs: 2648125
      - max_doc: 3994418
      - deleted_docs: 1346293
        }
    - merges: {
      - current: 0
      - current_docs: 0
      - current_size: 0b
      - current_size_in_bytes: 0
      - total: 37161
      - total_time: 56.5m
      - total_time_in_millis: 3393844
      - total_docs: 78031855
      - total_size: 101.5gb
      - total_size_in_bytes: 108996560347
        }
    - refresh: {
      - total: 328585
      - total_time: 1.6h
      - total_time_in_millis: 5779601
        }
    - flush: {
      - total: 4780
      - total_time: 5.8m
      - total_time_in_millis: 351341
        }
        }
        ]
- 4: [
  - {
    - routing: {
      - state: STARTED
      - primary: true
      - node: T5_oKWa3RTO7osracRtPoQ
      - relocating_node: null
      - shard: 4
      - index: places
        }
    - state: STARTED
    - index: {
      - size: 5.3gb
      - size_in_bytes: 5757997414
        }
    - translog: {
      - id: 1326141078676
      - operations: 2550
        }
    - docs: {
      - num_docs: 2804287
      - max_doc: 4323833
      - deleted_docs: 1519546
        }
    - merges: {
      - current: 0
      - current_docs: 0
      - current_size: 0b
      - current_size_in_bytes: 0
      - total: 100919
      - total_time: 2.7h
      - total_time_in_millis: 9958928
      - total_docs: 221717044
      - total_size: 288.5gb
      - total_size_in_bytes: 309792436202
        }
    - refresh: {
      - total: 893623
      - total_time: 4.3h
      - total_time_in_millis: 15799989
        }
    - flush: {
      - total: 13890
      - total_time: 15.4m
      - total_time_in_millis: 927816
        }
        }
  - {
    - routing: {
      - state: STARTED
      - primary: false
      - node: nELxlGs6Tzii12JRCNxpGw
      - relocating_node: null
      - shard: 4
      - index: places
        }
    - state: STARTED
    - index: {
      - size: 5.3gb
      - size_in_bytes: 5746398501
        }
    - translog: {
      - id: 1326141078670
      - operations: 852
        }
    - docs: {
      - num_docs: 2804287
      - max_doc: 4316365
      - deleted_docs: 1512078
        }
    - merges: {
      - current: 0
      - current_docs: 0
      - current_size: 0b
      - current_size_in_bytes: 0
      - total: 99169
      - total_time: 4.7h
      - total_time_in_millis: 17225939
      - total_docs: 222259294
      - total_size: 288.5gb
      - total_size_in_bytes: 309856763814
        }
    - refresh: {
      - total: 877186
      - total_time: 5.5h
      - total_time_in_millis: 19947688
        }
    - flush: {
      - total: 13880
      - total_time: 51.7m
      - total_time_in_millis: 3106099
        }
        }
        ]
- 5: [
  - {
    - routing: {
      - state: STARTED
      - primary: false
      - node: nELxlGs6Tzii12JRCNxpGw
      - relocating_node: null
      - shard: 5
      - index: places
        }
    - state: STARTED
    - index: {
      - size: 5.4gb
      - size_in_bytes: 5891305596
        }
    - translog: {
      - id: 1326141079451
      - operations: 388
        }
    - docs: {
      - num_docs: 2736604
      - max_doc: 4468212
      - deleted_docs: 1731608
        }
    - merges: {
      - current: 0
      - current_docs: 0
      - current_size: 0b
      - current_size_in_bytes: 0
      - total: 98430
      - total_time: 4.8h
      - total_time_in_millis: 17328442
      - total_docs: 218687019
      - total_size: 284.5gb
      - total_size_in_bytes: 305584707796
        }
    - refresh: {
      - total: 871118
      - total_time: 5.5h
      - total_time_in_millis: 20154030
        }
    - flush: {
      - total: 14086
      - total_time: 54m
      - total_time_in_millis: 3242524
        }
        }
  - {
    - routing: {
      - state: STARTED
      - primary: true
      - node: XGfCby6mT0Ox2zESwvBPsw
      - relocating_node: null
      - shard: 5
      - index: places
        }
    - state: STARTED
    - index: {
      - size: 4.1gb
      - size_in_bytes: 4454766570
        }
    - translog: {
      - id: 1326141079457
      - operations: 5454
        }
    - docs: {
      - num_docs: 2736604
      - max_doc: 3315194
      - deleted_docs: 578590
        }
    - merges: {
      - current: 0
      - current_docs: 0
      - current_size: 0b
      - current_size_in_bytes: 0
      - total: 100253
      - total_time: 2.7h
      - total_time_in_millis: 10036457
      - total_docs: 226688152
      - total_size: 294.4gb
      - total_size_in_bytes: 316205289645
        }
    - refresh: {
      - total: 888586
      - total_time: 4.3h
      - total_time_in_millis: 15768403
        }
    - flush: {
      - total: 14101
      - total_time: 14.9m
      - total_time_in_millis: 895393
        }
        }
        ]
        }

}

On Friday, March 16, 2012 7:10:05 PM UTC-7, John Cwikla wrote:

John_Cwikla · March 22, 2012, 1:52am

Ha, I figured out what was going on. I was totally ignoring a test index
that had been balanced onto two of the nodes and
so the 3 indexes' shards were balanced across ALL the nodes, my assumption
is that each index would be balanced by shard across
all the machines not that ALL shards are balanced across the nodes,
regardless of index.

I'd love that as an option (to balance shards across the nodes by index),
as it's moving my two indexes to be split pretty much 1 index on 2
machines, 1 index on the other
2 machines. Since one of my indexes is a legacy index (we use it about
1/100th the other one), I'm still not utilizing my machines.

Now I'll just need to figure out how to trick it

On Friday, March 16, 2012 7:10:05 PM UTC-7, John Cwikla wrote:

John_Cwikla · March 22, 2012, 2:53am

And just to finish up - I ended up setting my legacy index to have it's
replica's to node-1 instances -
it's only read-only, so it's just disk cost - and this is forcing my new
index to balance across
my machines evenly like I want.

Sweet!

On Friday, March 16, 2012 7:10:05 PM UTC-7, John Cwikla wrote:

Ivan · March 22, 2012, 11:04pm

I wonder if you are coming across this issue:
http://elasticsearch-users.115913.n3.nabble.com/Shard-Balancing-td3676576.html

On Wed, Mar 21, 2012 at 7:53 PM, John Cwikla cwikla@fwix.com wrote:

John_Cwikla · March 23, 2012, 1:18am

Yep, that looks like it. I like the pointer at the ability to create your
own allocator, definitely will look into that.

On Friday, March 16, 2012 7:10:05 PM UTC-7, John Cwikla wrote:

Topic		Replies	Views
Help please with high CPU utilization on 1 node of cluster :) Elasticsearch	9	11859	July 5, 2017
One node very high load average , why? Elasticsearch	9	1812	July 15, 2022
Uneven node load Elasticsearch	7	1939	July 5, 2017
Investigating elasticsearch load issues Elasticsearch	9	1221	July 6, 2017
High load average running on ES node Elasticsearch	6	7481	July 6, 2017

Heavy load on one node (1 index)

Otis

Otis

Related topics