ES issues and debugging help

We've been using ES for the last two months, slowly, getting ready to
start using it into production.

We have roughly 24m documents, on 3 machines, 24G, 24 core, 4 disk SSD
RAID 0 in each. Index size is roughly 36G in total. JVM is given up to
18G per machine.

we have it set up into two indexes, one that is using 6 shards and 1
replica with 16m documents, and one that is using 12 shards and 1
replica with about 350,000 documents. (We are basically rebuilding the
first index into the second). Benchmarking we can get about 100 q/s
per machine when it's doing nothing else, so we like that. I have many
shards as the number of items I expect to go into here is about
150-500m.

We are having the following issues and I'm looking for some tips on
debugging:

First, the documents aren't evenly shared, I have 24M, 16M, and 8M on
different machines. If I bring it down to two machines, it's even
16/16 - is that odd, or expected? I thought it would always be equal
across the 3.

Every day or so, one of the machines spikes quickly (in a matter of
minutes), runs out of memory, and basically becomes unresponsive. We
then see the other two basically spike, and sometimes they recover,
and often, they don't. This is while I'm running queries against it -
if they aren't being queried against, and I take a node down, they
rebalance in a matter of minutes and all is happy. We access
everything through PyES.

I'm really looking for ways to debug what's going on - the logs aren't
that useful, I see out of memory on one, and then the loss of
connection to it from the others - I can post some if it would be
useful. We did see one stack trace (once) that may showed the stack
in a delete, but that may have just been random.

And while I'm at it, anyone (Shay? :slight_smile: do ES consulting to potentially
work with us to figure out what's going on?

Here's the two shards:

{
"ok" : true,
"_shards" : {
"total" : 6,
"successful" : 6,
"failed" : 0
},
"indices" : {
"legacy_places" : {
"index" : {
"primary_size" : "19gb",
"primary_size_in_bytes" : 20476411666,
"size" : "38.1gb",
"size_in_bytes" : 40952823332
},
"translog" : {
"operations" : 0
},
"docs" : {
"num_docs" : 24370334,
"max_doc" : 24385103,
"deleted_docs" : 14769
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 187,
"total_time" : "138ms",
"total_time_in_millis" : 138
},
"shards" : {
"0" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 0,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.1gb",
"size_in_bytes" : 6656257213
},
"translog" : {
"id" : 1323890905623,
"operations" : 0
},
"docs" : {
"num_docs" : 7917088,
"max_doc" : 7921473,
"deleted_docs" : 4385
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 32,
"total_time" : "25ms",
"total_time_in_millis" : 25
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 0,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.1gb",
"size_in_bytes" : 6656257213
},
"translog" : {
"id" : 1323890905623,
"operations" : 0
},
"docs" : {
"num_docs" : 7917088,
"max_doc" : 7921473,
"deleted_docs" : 4385
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "24ms",
"total_time_in_millis" : 24
}
} ],
"1" : [ {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 1,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.4gb",
"size_in_bytes" : 6954949268
},
"translog" : {
"id" : 1323890905917,
"operations" : 0
},
"docs" : {
"num_docs" : 8257213,
"max_doc" : 8262908,
"deleted_docs" : 5695
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "24ms",
"total_time_in_millis" : 24
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 1,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.4gb",
"size_in_bytes" : 6954949268
},
"translog" : {
"id" : 1323890905917,
"operations" : 0
},
"docs" : {
"num_docs" : 8257213,
"max_doc" : 8262908,
"deleted_docs" : 5695
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "24ms",
"total_time_in_millis" : 24
}
} ],
"2" : [ {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 2,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.3gb",
"size_in_bytes" : 6865205185
},
"translog" : {
"id" : 1323890905928,
"operations" : 0
},
"docs" : {
"num_docs" : 8196033,
"max_doc" : 8200722,
"deleted_docs" : 4689
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "22ms",
"total_time_in_millis" : 22
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 2,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.3gb",
"size_in_bytes" : 6865205185
},
"translog" : {
"id" : 1323890905928,
"operations" : 0
},
"docs" : {
"num_docs" : 8196033,
"max_doc" : 8200722,
"deleted_docs" : 4689
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "19ms",
"total_time_in_millis" : 19
}
} ]
}
}
}
}

And the smaller one:

{
"ok" : true,
"_shards" : {
"total" : 12,
"successful" : 12,
"failed" : 0
},
"indices" : {
"places" : {
"index" : {
"primary_size" : "287.3mb",
"primary_size_in_bytes" : 301304028,
"size" : "574.6mb",
"size_in_bytes" : 602608056
},
"translog" : {
"operations" : 0
},
"docs" : {
"num_docs" : 328917,
"max_doc" : 370550,
"deleted_docs" : 41633
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 372,
"total_time" : "140ms",
"total_time_in_millis" : 140
},
"shards" : {
"0" : [ {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 0,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "44.9mb",
"size_in_bytes" : 47116017
},
"translog" : {
"id" : 1324591339645,
"operations" : 0
},
"docs" : {
"num_docs" : 52001,
"max_doc" : 58573,
"deleted_docs" : 6572
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "13ms",
"total_time_in_millis" : 13
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 0,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "44.9mb",
"size_in_bytes" : 47116017
},
"translog" : {
"id" : 1324591339645,
"operations" : 0
},
"docs" : {
"num_docs" : 52001,
"max_doc" : 58573,
"deleted_docs" : 6572
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "12ms",
"total_time_in_millis" : 12
}
} ],
"1" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 1,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "43.5mb",
"size_in_bytes" : 45616358
},
"translog" : {
"id" : 1324591339818,
"operations" : 0
},
"docs" : {
"num_docs" : 50991,
"max_doc" : 55917,
"deleted_docs" : 4926
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "10ms",
"total_time_in_millis" : 10
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 1,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "43.5mb",
"size_in_bytes" : 45616358
},
"translog" : {
"id" : 1324591339818,
"operations" : 0
},
"docs" : {
"num_docs" : 50991,
"max_doc" : 55917,
"deleted_docs" : 4926
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "13ms",
"total_time_in_millis" : 13
}
} ],
"2" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 2,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "54.7mb",
"size_in_bytes" : 57375684
},
"translog" : {
"id" : 1324591339466,
"operations" : 0
},
"docs" : {
"num_docs" : 60920,
"max_doc" : 70373,
"deleted_docs" : 9453
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "13ms",
"total_time_in_millis" : 13
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 2,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "54.7mb",
"size_in_bytes" : 57375684
},
"translog" : {
"id" : 1324591339466,
"operations" : 0
},
"docs" : {
"num_docs" : 60920,
"max_doc" : 70373,
"deleted_docs" : 9453
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "15ms",
"total_time_in_millis" : 15
}
} ],
"3" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 3,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "45.4mb",
"size_in_bytes" : 47665161
},
"translog" : {
"id" : 1324591339663,
"operations" : 0
},
"docs" : {
"num_docs" : 52301,
"max_doc" : 58879,
"deleted_docs" : 6578
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "13ms",
"total_time_in_millis" : 13
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 3,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "45.4mb",
"size_in_bytes" : 47665161
},
"translog" : {
"id" : 1324591339663,
"operations" : 0
},
"docs" : {
"num_docs" : 52301,
"max_doc" : 58879,
"deleted_docs" : 6578
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "14ms",
"total_time_in_millis" : 14
}
} ],
"4" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 4,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "44.1mb",
"size_in_bytes" : 46278154
},
"translog" : {
"id" : 1324591339839,
"operations" : 0
},
"docs" : {
"num_docs" : 51625,
"max_doc" : 56555,
"deleted_docs" : 4930
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "11ms",
"total_time_in_millis" : 11
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 4,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "44.1mb",
"size_in_bytes" : 46278154
},
"translog" : {
"id" : 1324591339839,
"operations" : 0
},
"docs" : {
"num_docs" : 51625,
"max_doc" : 56555,
"deleted_docs" : 4930
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "13ms",
"total_time_in_millis" : 13
}
} ],
"5" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 5,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "54.6mb",
"size_in_bytes" : 57252654
},
"translog" : {
"id" : 1324591339578,
"operations" : 0
},
"docs" : {
"num_docs" : 61079,
"max_doc" : 70253,
"deleted_docs" : 9174
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "6ms",
"total_time_in_millis" : 6
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 5,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "54.6mb",
"size_in_bytes" : 57252654
},
"translog" : {
"id" : 1324591339578,
"operations" : 0
},
"docs" : {
"num_docs" : 61079,
"max_doc" : 70253,
"deleted_docs" : 9174
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "7ms",
"total_time_in_millis" : 7
}
} ]
}
}
}
}

John Cwikla
VPE
fwix

Hey:

Regarding evenly distributing "docs", elasticsearch will aim to have an
even number of shards between nodes, regardless of their "weight", so you
might get into this situation. We should support weighted allocation to
make this better, its not there yet.

Regarding the failure, do you have the logged OutOfMemory failure? Can you
gist it? It usually happens when a search that suddenly requires much more
memory to be loaded happens, for example, when suddenly searching with a
new field to sort on, that field is loaded to memory, which can cause this.

-shay.banon

On Thu, Dec 29, 2011 at 11:44 PM, John Cwikla cwikla@fwix.com wrote:

We've been using ES for the last two months, slowly, getting ready to
start using it into production.

We have roughly 24m documents, on 3 machines, 24G, 24 core, 4 disk SSD
RAID 0 in each. Index size is roughly 36G in total. JVM is given up to
18G per machine.

we have it set up into two indexes, one that is using 6 shards and 1
replica with 16m documents, and one that is using 12 shards and 1
replica with about 350,000 documents. (We are basically rebuilding the
first index into the second). Benchmarking we can get about 100 q/s
per machine when it's doing nothing else, so we like that. I have many
shards as the number of items I expect to go into here is about
150-500m.

We are having the following issues and I'm looking for some tips on
debugging:

First, the documents aren't evenly shared, I have 24M, 16M, and 8M on
different machines. If I bring it down to two machines, it's even
16/16 - is that odd, or expected? I thought it would always be equal
across the 3.

Every day or so, one of the machines spikes quickly (in a matter of
minutes), runs out of memory, and basically becomes unresponsive. We
then see the other two basically spike, and sometimes they recover,
and often, they don't. This is while I'm running queries against it -
if they aren't being queried against, and I take a node down, they
rebalance in a matter of minutes and all is happy. We access
everything through PyES.

I'm really looking for ways to debug what's going on - the logs aren't
that useful, I see out of memory on one, and then the loss of
connection to it from the others - I can post some if it would be
useful. We did see one stack trace (once) that may showed the stack
in a delete, but that may have just been random.

And while I'm at it, anyone (Shay? :slight_smile: do ES consulting to potentially
work with us to figure out what's going on?

Here's the two shards:

{
"ok" : true,
"_shards" : {
"total" : 6,
"successful" : 6,
"failed" : 0
},
"indices" : {
"legacy_places" : {
"index" : {
"primary_size" : "19gb",
"primary_size_in_bytes" : 20476411666,
"size" : "38.1gb",
"size_in_bytes" : 40952823332
},
"translog" : {
"operations" : 0
},
"docs" : {
"num_docs" : 24370334,
"max_doc" : 24385103,
"deleted_docs" : 14769
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 187,
"total_time" : "138ms",
"total_time_in_millis" : 138
},
"shards" : {
"0" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 0,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.1gb",
"size_in_bytes" : 6656257213
},
"translog" : {
"id" : 1323890905623,
"operations" : 0
},
"docs" : {
"num_docs" : 7917088,
"max_doc" : 7921473,
"deleted_docs" : 4385
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 32,
"total_time" : "25ms",
"total_time_in_millis" : 25
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 0,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.1gb",
"size_in_bytes" : 6656257213
},
"translog" : {
"id" : 1323890905623,
"operations" : 0
},
"docs" : {
"num_docs" : 7917088,
"max_doc" : 7921473,
"deleted_docs" : 4385
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "24ms",
"total_time_in_millis" : 24
}
} ],
"1" : [ {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 1,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.4gb",
"size_in_bytes" : 6954949268
},
"translog" : {
"id" : 1323890905917,
"operations" : 0
},
"docs" : {
"num_docs" : 8257213,
"max_doc" : 8262908,
"deleted_docs" : 5695
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "24ms",
"total_time_in_millis" : 24
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 1,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.4gb",
"size_in_bytes" : 6954949268
},
"translog" : {
"id" : 1323890905917,
"operations" : 0
},
"docs" : {
"num_docs" : 8257213,
"max_doc" : 8262908,
"deleted_docs" : 5695
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "24ms",
"total_time_in_millis" : 24
}
} ],
"2" : [ {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 2,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.3gb",
"size_in_bytes" : 6865205185
},
"translog" : {
"id" : 1323890905928,
"operations" : 0
},
"docs" : {
"num_docs" : 8196033,
"max_doc" : 8200722,
"deleted_docs" : 4689
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "22ms",
"total_time_in_millis" : 22
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 2,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.3gb",
"size_in_bytes" : 6865205185
},
"translog" : {
"id" : 1323890905928,
"operations" : 0
},
"docs" : {
"num_docs" : 8196033,
"max_doc" : 8200722,
"deleted_docs" : 4689
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "19ms",
"total_time_in_millis" : 19
}
} ]
}
}
}
}

And the smaller one:

{
"ok" : true,
"_shards" : {
"total" : 12,
"successful" : 12,
"failed" : 0
},
"indices" : {
"places" : {
"index" : {
"primary_size" : "287.3mb",
"primary_size_in_bytes" : 301304028,
"size" : "574.6mb",
"size_in_bytes" : 602608056
},
"translog" : {
"operations" : 0
},
"docs" : {
"num_docs" : 328917,
"max_doc" : 370550,
"deleted_docs" : 41633
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 372,
"total_time" : "140ms",
"total_time_in_millis" : 140
},
"shards" : {
"0" : [ {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 0,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "44.9mb",
"size_in_bytes" : 47116017
},
"translog" : {
"id" : 1324591339645,
"operations" : 0
},
"docs" : {
"num_docs" : 52001,
"max_doc" : 58573,
"deleted_docs" : 6572
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "13ms",
"total_time_in_millis" : 13
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 0,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "44.9mb",
"size_in_bytes" : 47116017
},
"translog" : {
"id" : 1324591339645,
"operations" : 0
},
"docs" : {
"num_docs" : 52001,
"max_doc" : 58573,
"deleted_docs" : 6572
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "12ms",
"total_time_in_millis" : 12
}
} ],
"1" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 1,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "43.5mb",
"size_in_bytes" : 45616358
},
"translog" : {
"id" : 1324591339818,
"operations" : 0
},
"docs" : {
"num_docs" : 50991,
"max_doc" : 55917,
"deleted_docs" : 4926
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "10ms",
"total_time_in_millis" : 10
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 1,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "43.5mb",
"size_in_bytes" : 45616358
},
"translog" : {
"id" : 1324591339818,
"operations" : 0
},
"docs" : {
"num_docs" : 50991,
"max_doc" : 55917,
"deleted_docs" : 4926
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "13ms",
"total_time_in_millis" : 13
}
} ],
"2" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 2,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "54.7mb",
"size_in_bytes" : 57375684
},
"translog" : {
"id" : 1324591339466,
"operations" : 0
},
"docs" : {
"num_docs" : 60920,
"max_doc" : 70373,
"deleted_docs" : 9453
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "13ms",
"total_time_in_millis" : 13
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 2,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "54.7mb",
"size_in_bytes" : 57375684
},
"translog" : {
"id" : 1324591339466,
"operations" : 0
},
"docs" : {
"num_docs" : 60920,
"max_doc" : 70373,
"deleted_docs" : 9453
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "15ms",
"total_time_in_millis" : 15
}
} ],
"3" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 3,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "45.4mb",
"size_in_bytes" : 47665161
},
"translog" : {
"id" : 1324591339663,
"operations" : 0
},
"docs" : {
"num_docs" : 52301,
"max_doc" : 58879,
"deleted_docs" : 6578
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "13ms",
"total_time_in_millis" : 13
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 3,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "45.4mb",
"size_in_bytes" : 47665161
},
"translog" : {
"id" : 1324591339663,
"operations" : 0
},
"docs" : {
"num_docs" : 52301,
"max_doc" : 58879,
"deleted_docs" : 6578
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "14ms",
"total_time_in_millis" : 14
}
} ],
"4" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 4,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "44.1mb",
"size_in_bytes" : 46278154
},
"translog" : {
"id" : 1324591339839,
"operations" : 0
},
"docs" : {
"num_docs" : 51625,
"max_doc" : 56555,
"deleted_docs" : 4930
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "11ms",
"total_time_in_millis" : 11
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 4,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "44.1mb",
"size_in_bytes" : 46278154
},
"translog" : {
"id" : 1324591339839,
"operations" : 0
},
"docs" : {
"num_docs" : 51625,
"max_doc" : 56555,
"deleted_docs" : 4930
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "13ms",
"total_time_in_millis" : 13
}
} ],
"5" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 5,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "54.6mb",
"size_in_bytes" : 57252654
},
"translog" : {
"id" : 1324591339578,
"operations" : 0
},
"docs" : {
"num_docs" : 61079,
"max_doc" : 70253,
"deleted_docs" : 9174
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "6ms",
"total_time_in_millis" : 6
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 5,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "54.6mb",
"size_in_bytes" : 57252654
},
"translog" : {
"id" : 1324591339578,
"operations" : 0
},
"docs" : {
"num_docs" : 61079,
"max_doc" : 70253,
"deleted_docs" : 9174
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "7ms",
"total_time_in_millis" : 7
}
} ]
}
}
}
}

John Cwikla
VPE
fwix

The odd part about the balancing is that I can watch the nodes catch
up, they'll
each have 16m, and then a few minutes later it will end up at the
8/16/24.

We are looking into getting you a gist, I want to reproduce it first -
just trying to
figure out the best way to reproduce our case at this point. We
currently only have
about 5-10 processes hitting the servers (balanced across the 3), and
the memory looks
like it just keeps creeping up and up. Once the nodes get to about
90% of allocated
memory, if one of the nodes spikes and runs out of memory, all hell
breaks loose, nodes
stop talking to each other, and I need to restart them.

The other interesting part is that once I've restarted the nodes, the
scripts
haven't stopped running (they are just logging errors and moving on),
the memory starts at the minimum
and starts crawling up again. It's like nothing is ever being free'd.

On Dec 30, 4:40 am, Shay Banon kim...@gmail.com wrote:

Hey:

Regarding evenly distributing "docs", elasticsearch will aim to have an
even number of shards between nodes, regardless of their "weight", so you
might get into this situation. We should support weighted allocation to
make this better, its not there yet.

Regarding the failure, do you have the logged OutOfMemory failure? Can you
gist it? It usually happens when a search that suddenly requires much more
memory to be loaded happens, for example, when suddenly searching with a
new field to sort on, that field is loaded to memory, which can cause this.

-shay.banon

On Thu, Dec 29, 2011 at 11:44 PM, John Cwikla cwi...@fwix.com wrote:

We've been using ES for the last two months, slowly, getting ready to
start using it into production.

We have roughly 24m documents, on 3 machines, 24G, 24 core, 4 disk SSD
RAID 0 in each. Index size is roughly 36G in total. JVM is given up to
18G per machine.

we have it set up into two indexes, one that is using 6 shards and 1
replica with 16m documents, and one that is using 12 shards and 1
replica with about 350,000 documents. (We are basically rebuilding the
first index into the second). Benchmarking we can get about 100 q/s
per machine when it's doing nothing else, so we like that. I have many
shards as the number of items I expect to go into here is about
150-500m.

We are having the following issues and I'm looking for some tips on
debugging:

First, the documents aren't evenly shared, I have 24M, 16M, and 8M on
different machines. If I bring it down to two machines, it's even
16/16 - is that odd, or expected? I thought it would always be equal
across the 3.

Every day or so, one of the machines spikes quickly (in a matter of
minutes), runs out of memory, and basically becomes unresponsive. We
then see the other two basically spike, and sometimes they recover,
and often, they don't. This is while I'm running queries against it -
if they aren't being queried against, and I take a node down, they
rebalance in a matter of minutes and all is happy. We access
everything through PyES.

I'm really looking for ways to debug what's going on - the logs aren't
that useful, I see out of memory on one, and then the loss of
connection to it from the others - I can post some if it would be
useful. We did see one stack trace (once) that may showed the stack
in a delete, but that may have just been random.

And while I'm at it, anyone (Shay? :slight_smile: do ES consulting to potentially
work with us to figure out what's going on?

Here's the two shards:

{
"ok" : true,
"_shards" : {
"total" : 6,
"successful" : 6,
"failed" : 0
},
"indices" : {
"legacy_places" : {
"index" : {
"primary_size" : "19gb",
"primary_size_in_bytes" : 20476411666,
"size" : "38.1gb",
"size_in_bytes" : 40952823332
},
"translog" : {
"operations" : 0
},
"docs" : {
"num_docs" : 24370334,
"max_doc" : 24385103,
"deleted_docs" : 14769
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 187,
"total_time" : "138ms",
"total_time_in_millis" : 138
},
"shards" : {
"0" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 0,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.1gb",
"size_in_bytes" : 6656257213
},
"translog" : {
"id" : 1323890905623,
"operations" : 0
},
"docs" : {
"num_docs" : 7917088,
"max_doc" : 7921473,
"deleted_docs" : 4385
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 32,
"total_time" : "25ms",
"total_time_in_millis" : 25
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 0,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.1gb",
"size_in_bytes" : 6656257213
},
"translog" : {
"id" : 1323890905623,
"operations" : 0
},
"docs" : {
"num_docs" : 7917088,
"max_doc" : 7921473,
"deleted_docs" : 4385
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "24ms",
"total_time_in_millis" : 24
}
} ],
"1" : [ {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 1,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.4gb",
"size_in_bytes" : 6954949268
},
"translog" : {
"id" : 1323890905917,
"operations" : 0
},
"docs" : {
"num_docs" : 8257213,
"max_doc" : 8262908,
"deleted_docs" : 5695
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "24ms",
"total_time_in_millis" : 24
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 1,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.4gb",
"size_in_bytes" : 6954949268
},
"translog" : {
"id" : 1323890905917,
"operations" : 0
},
"docs" : {
"num_docs" : 8257213,
"max_doc" : 8262908,
"deleted_docs" : 5695
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "24ms",
"total_time_in_millis" : 24
}
} ],
"2" : [ {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 2,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.3gb",
"size_in_bytes" : 6865205185
},
"translog" : {
"id" : 1323890905928,
"operations" : 0
},
"docs" : {
"num_docs" : 8196033,
"max_doc" : 8200722,
"deleted_docs" : 4689
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",

...

read more »

When the servers start fresh and you start to run your "scripts", how much
memory do they consume? Basically, once Java gets OOM, it dumps the heap to
a file, and elasticsearch is configured to do it:
https://github.com/elasticsearch/elasticsearch/blob/master/bin/elasticsearch.in.sh#L44
.

The problem is that the size of the file is as the heap you allocated,
which is 18gb. If the servers start (post running your search requests)
with, lets say 4gb, then maybe you can allocate 8gb to the ES (java)
process, let it OOM, and gzip->dropbox the file so I can download it and
see where the memory is being used?

Another option is to get into a little bit more details as to what you run
against elasticsearch, what type of search requests, for example.

On Fri, Dec 30, 2011 at 9:35 PM, John Cwikla cwikla@fwix.com wrote:

The odd part about the balancing is that I can watch the nodes catch
up, they'll
each have 16m, and then a few minutes later it will end up at the
8/16/24.

We are looking into getting you a gist, I want to reproduce it first -
just trying to
figure out the best way to reproduce our case at this point. We
currently only have
about 5-10 processes hitting the servers (balanced across the 3), and
the memory looks
like it just keeps creeping up and up. Once the nodes get to about
90% of allocated
memory, if one of the nodes spikes and runs out of memory, all hell
breaks loose, nodes
stop talking to each other, and I need to restart them.

The other interesting part is that once I've restarted the nodes, the
scripts
haven't stopped running (they are just logging errors and moving on),
the memory starts at the minimum
and starts crawling up again. It's like nothing is ever being free'd.

On Dec 30, 4:40 am, Shay Banon kim...@gmail.com wrote:

Hey:

Regarding evenly distributing "docs", elasticsearch will aim to have an
even number of shards between nodes, regardless of their "weight", so you
might get into this situation. We should support weighted allocation to
make this better, its not there yet.

Regarding the failure, do you have the logged OutOfMemory failure? Can
you
gist it? It usually happens when a search that suddenly requires much
more
memory to be loaded happens, for example, when suddenly searching with a
new field to sort on, that field is loaded to memory, which can cause
this.

-shay.banon

On Thu, Dec 29, 2011 at 11:44 PM, John Cwikla cwi...@fwix.com wrote:

We've been using ES for the last two months, slowly, getting ready to
start using it into production.

We have roughly 24m documents, on 3 machines, 24G, 24 core, 4 disk SSD
RAID 0 in each. Index size is roughly 36G in total. JVM is given up to
18G per machine.

we have it set up into two indexes, one that is using 6 shards and 1
replica with 16m documents, and one that is using 12 shards and 1
replica with about 350,000 documents. (We are basically rebuilding the
first index into the second). Benchmarking we can get about 100 q/s
per machine when it's doing nothing else, so we like that. I have many
shards as the number of items I expect to go into here is about
150-500m.

We are having the following issues and I'm looking for some tips on
debugging:

First, the documents aren't evenly shared, I have 24M, 16M, and 8M on
different machines. If I bring it down to two machines, it's even
16/16 - is that odd, or expected? I thought it would always be equal
across the 3.

Every day or so, one of the machines spikes quickly (in a matter of
minutes), runs out of memory, and basically becomes unresponsive. We
then see the other two basically spike, and sometimes they recover,
and often, they don't. This is while I'm running queries against it -
if they aren't being queried against, and I take a node down, they
rebalance in a matter of minutes and all is happy. We access
everything through PyES.

I'm really looking for ways to debug what's going on - the logs aren't
that useful, I see out of memory on one, and then the loss of
connection to it from the others - I can post some if it would be
useful. We did see one stack trace (once) that may showed the stack
in a delete, but that may have just been random.

And while I'm at it, anyone (Shay? :slight_smile: do ES consulting to potentially
work with us to figure out what's going on?

Here's the two shards:

{
"ok" : true,
"_shards" : {
"total" : 6,
"successful" : 6,
"failed" : 0
},
"indices" : {
"legacy_places" : {
"index" : {
"primary_size" : "19gb",
"primary_size_in_bytes" : 20476411666,
"size" : "38.1gb",
"size_in_bytes" : 40952823332
},
"translog" : {
"operations" : 0
},
"docs" : {
"num_docs" : 24370334,
"max_doc" : 24385103,
"deleted_docs" : 14769
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 187,
"total_time" : "138ms",
"total_time_in_millis" : 138
},
"shards" : {
"0" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 0,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.1gb",
"size_in_bytes" : 6656257213
},
"translog" : {
"id" : 1323890905623,
"operations" : 0
},
"docs" : {
"num_docs" : 7917088,
"max_doc" : 7921473,
"deleted_docs" : 4385
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 32,
"total_time" : "25ms",
"total_time_in_millis" : 25
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 0,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.1gb",
"size_in_bytes" : 6656257213
},
"translog" : {
"id" : 1323890905623,
"operations" : 0
},
"docs" : {
"num_docs" : 7917088,
"max_doc" : 7921473,
"deleted_docs" : 4385
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "24ms",
"total_time_in_millis" : 24
}
} ],
"1" : [ {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 1,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.4gb",
"size_in_bytes" : 6954949268
},
"translog" : {
"id" : 1323890905917,
"operations" : 0
},
"docs" : {
"num_docs" : 8257213,
"max_doc" : 8262908,
"deleted_docs" : 5695
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "24ms",
"total_time_in_millis" : 24
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 1,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.4gb",
"size_in_bytes" : 6954949268
},
"translog" : {
"id" : 1323890905917,
"operations" : 0
},
"docs" : {
"num_docs" : 8257213,
"max_doc" : 8262908,
"deleted_docs" : 5695
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "24ms",
"total_time_in_millis" : 24
}
} ],
"2" : [ {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 2,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.3gb",
"size_in_bytes" : 6865205185
},
"translog" : {
"id" : 1323890905928,
"operations" : 0
},
"docs" : {
"num_docs" : 8196033,
"max_doc" : 8200722,
"deleted_docs" : 4689
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",

...

read more »

Well I was trying but they are pretty big. I've narrowed it down - we
are using a custom scorer that balloons.
Walking over an index with 12,000,000 docs, I'm finding that every
half million or so the memory jumps up 2-4G
(not progressive, a step of 2-4G), so after a million or so, my
machines die an untimely death.

I'll keep trying to reproduce a case, but in the meantime we've
decided to modify our code to handle a second phase
scoring in our code which should solve our problem, since without our
scorer, nothing bad happens :slight_smile:

On Dec 30 2011, 12:45 pm, Shay Banon kim...@gmail.com wrote:

When the servers start fresh and you start to run your "scripts", how much
memory do they consume? Basically, once Java gets OOM, it dumps the heap to
a file, and elasticsearch is configured to do it:https://github.com/elasticsearch/elasticsearch/blob/master/bin/elasti...
.

The problem is that the size of the file is as the heap you allocated,
which is 18gb. If the servers start (post running your search requests)
with, lets say 4gb, then maybe you can allocate 8gb to the ES (java)
process, let it OOM, and gzip->dropbox the file so I can download it and
see where the memory is being used?

Another option is to get into a little bit more details as to what you run
against elasticsearch, what type of search requests, for example.

On Fri, Dec 30, 2011 at 9:35 PM, John Cwikla cwi...@fwix.com wrote:

The odd part about the balancing is that I can watch the nodes catch
up, they'll
each have 16m, and then a few minutes later it will end up at the
8/16/24.

We are looking into getting you a gist, I want to reproduce it first -
just trying to
figure out the best way to reproduce our case at this point. We
currently only have
about 5-10 processes hitting the servers (balanced across the 3), and
the memory looks
like it just keeps creeping up and up. Once the nodes get to about
90% of allocated
memory, if one of the nodes spikes and runs out of memory, all hell
breaks loose, nodes
stop talking to each other, and I need to restart them.

The other interesting part is that once I've restarted the nodes, the
scripts
haven't stopped running (they are just logging errors and moving on),
the memory starts at the minimum
and starts crawling up again. It's like nothing is ever being free'd.

On Dec 30, 4:40 am, Shay Banon kim...@gmail.com wrote:

Hey:

Regarding evenly distributing "docs", elasticsearch will aim to have an
even number of shards between nodes, regardless of their "weight", so you
might get into this situation. We should support weighted allocation to
make this better, its not there yet.

Regarding the failure, do you have the logged OutOfMemory failure? Can
you
gist it? It usually happens when a search that suddenly requires much
more
memory to be loaded happens, for example, when suddenly searching with a
new field to sort on, that field is loaded to memory, which can cause
this.

-shay.banon

On Thu, Dec 29, 2011 at 11:44 PM, John Cwikla cwi...@fwix.com wrote:

We've been using ES for the last two months, slowly, getting ready to
start using it into production.

We have roughly 24m documents, on 3 machines, 24G, 24 core, 4 disk SSD
RAID 0 in each. Index size is roughly 36G in total. JVM is given up to
18G per machine.

we have it set up into two indexes, one that is using 6 shards and 1
replica with 16m documents, and one that is using 12 shards and 1
replica with about 350,000 documents. (We are basically rebuilding the
first index into the second). Benchmarking we can get about 100 q/s
per machine when it's doing nothing else, so we like that. I have many
shards as the number of items I expect to go into here is about
150-500m.

We are having the following issues and I'm looking for some tips on
debugging:

First, the documents aren't evenly shared, I have 24M, 16M, and 8M on
different machines. If I bring it down to two machines, it's even
16/16 - is that odd, or expected? I thought it would always be equal
across the 3.

Every day or so, one of the machines spikes quickly (in a matter of
minutes), runs out of memory, and basically becomes unresponsive. We
then see the other two basically spike, and sometimes they recover,
and often, they don't. This is while I'm running queries against it -
if they aren't being queried against, and I take a node down, they
rebalance in a matter of minutes and all is happy. We access
everything through PyES.

I'm really looking for ways to debug what's going on - the logs aren't
that useful, I see out of memory on one, and then the loss of
connection to it from the others - I can post some if it would be
useful. We did see one stack trace (once) that may showed the stack
in a delete, but that may have just been random.

And while I'm at it, anyone (Shay? :slight_smile: do ES consulting to potentially
work with us to figure out what's going on?

Here's the two shards:

{
"ok" : true,
"_shards" : {
"total" : 6,
"successful" : 6,
"failed" : 0
},
"indices" : {
"legacy_places" : {
"index" : {
"primary_size" : "19gb",
"primary_size_in_bytes" : 20476411666,
"size" : "38.1gb",
"size_in_bytes" : 40952823332
},
"translog" : {
"operations" : 0
},
"docs" : {
"num_docs" : 24370334,
"max_doc" : 24385103,
"deleted_docs" : 14769
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 187,
"total_time" : "138ms",
"total_time_in_millis" : 138
},
"shards" : {
"0" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 0,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.1gb",
"size_in_bytes" : 6656257213
},
"translog" : {
"id" : 1323890905623,
"operations" : 0
},
"docs" : {
"num_docs" : 7917088,
"max_doc" : 7921473,
"deleted_docs" : 4385
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 32,
"total_time" : "25ms",
"total_time_in_millis" : 25
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 0,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.1gb",
"size_in_bytes" : 6656257213
},
"translog" : {
"id" : 1323890905623,
"operations" : 0
},
"docs" : {
"num_docs" : 7917088,
"max_doc" : 7921473,
"deleted_docs" : 4385
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "24ms",
"total_time_in_millis" : 24
}
} ],
"1" : [ {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 1,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.4gb",
"size_in_bytes" : 6954949268
},
"translog" : {
"id" : 1323890905917,
"operations" : 0
},
"docs" : {
"num_docs" : 8257213,
"max_doc" : 8262908,
"deleted_docs" : 5695
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",

...

read more »