We've been using ES for the last two months, slowly, getting ready to
start using it into production.
We have roughly 24m documents, on 3 machines, 24G, 24 core, 4 disk SSD
RAID 0 in each. Index size is roughly 36G in total. JVM is given up to
18G per machine.
we have it set up into two indexes, one that is using 6 shards and 1
replica with 16m documents, and one that is using 12 shards and 1
replica with about 350,000 documents. (We are basically rebuilding the
first index into the second). Benchmarking we can get about 100 q/s
per machine when it's doing nothing else, so we like that. I have many
shards as the number of items I expect to go into here is about
150-500m.
We are having the following issues and I'm looking for some tips on
debugging:
First, the documents aren't evenly shared, I have 24M, 16M, and 8M on
different machines. If I bring it down to two machines, it's even
16/16 - is that odd, or expected? I thought it would always be equal
across the 3.
Every day or so, one of the machines spikes quickly (in a matter of
minutes), runs out of memory, and basically becomes unresponsive. We
then see the other two basically spike, and sometimes they recover,
and often, they don't. This is while I'm running queries against it -
if they aren't being queried against, and I take a node down, they
rebalance in a matter of minutes and all is happy. We access
everything through PyES.
I'm really looking for ways to debug what's going on - the logs aren't
that useful, I see out of memory on one, and then the loss of
connection to it from the others - I can post some if it would be
useful. We did see one stack trace (once) that may showed the stack
in a delete, but that may have just been random.
And while I'm at it, anyone (Shay? do ES consulting to potentially
work with us to figure out what's going on?
Here's the two shards:
{
"ok" : true,
"_shards" : {
"total" : 6,
"successful" : 6,
"failed" : 0
},
"indices" : {
"legacy_places" : {
"index" : {
"primary_size" : "19gb",
"primary_size_in_bytes" : 20476411666,
"size" : "38.1gb",
"size_in_bytes" : 40952823332
},
"translog" : {
"operations" : 0
},
"docs" : {
"num_docs" : 24370334,
"max_doc" : 24385103,
"deleted_docs" : 14769
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 187,
"total_time" : "138ms",
"total_time_in_millis" : 138
},
"shards" : {
"0" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 0,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.1gb",
"size_in_bytes" : 6656257213
},
"translog" : {
"id" : 1323890905623,
"operations" : 0
},
"docs" : {
"num_docs" : 7917088,
"max_doc" : 7921473,
"deleted_docs" : 4385
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 32,
"total_time" : "25ms",
"total_time_in_millis" : 25
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 0,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.1gb",
"size_in_bytes" : 6656257213
},
"translog" : {
"id" : 1323890905623,
"operations" : 0
},
"docs" : {
"num_docs" : 7917088,
"max_doc" : 7921473,
"deleted_docs" : 4385
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "24ms",
"total_time_in_millis" : 24
}
} ],
"1" : [ {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 1,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.4gb",
"size_in_bytes" : 6954949268
},
"translog" : {
"id" : 1323890905917,
"operations" : 0
},
"docs" : {
"num_docs" : 8257213,
"max_doc" : 8262908,
"deleted_docs" : 5695
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "24ms",
"total_time_in_millis" : 24
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 1,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.4gb",
"size_in_bytes" : 6954949268
},
"translog" : {
"id" : 1323890905917,
"operations" : 0
},
"docs" : {
"num_docs" : 8257213,
"max_doc" : 8262908,
"deleted_docs" : 5695
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "24ms",
"total_time_in_millis" : 24
}
} ],
"2" : [ {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 2,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.3gb",
"size_in_bytes" : 6865205185
},
"translog" : {
"id" : 1323890905928,
"operations" : 0
},
"docs" : {
"num_docs" : 8196033,
"max_doc" : 8200722,
"deleted_docs" : 4689
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "22ms",
"total_time_in_millis" : 22
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 2,
"index" : "legacy_places"
},
"state" : "STARTED",
"index" : {
"size" : "6.3gb",
"size_in_bytes" : 6865205185
},
"translog" : {
"id" : 1323890905928,
"operations" : 0
},
"docs" : {
"num_docs" : 8196033,
"max_doc" : 8200722,
"deleted_docs" : 4689
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "19ms",
"total_time_in_millis" : 19
}
} ]
}
}
}
}
And the smaller one:
{
"ok" : true,
"_shards" : {
"total" : 12,
"successful" : 12,
"failed" : 0
},
"indices" : {
"places" : {
"index" : {
"primary_size" : "287.3mb",
"primary_size_in_bytes" : 301304028,
"size" : "574.6mb",
"size_in_bytes" : 602608056
},
"translog" : {
"operations" : 0
},
"docs" : {
"num_docs" : 328917,
"max_doc" : 370550,
"deleted_docs" : 41633
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 372,
"total_time" : "140ms",
"total_time_in_millis" : 140
},
"shards" : {
"0" : [ {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 0,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "44.9mb",
"size_in_bytes" : 47116017
},
"translog" : {
"id" : 1324591339645,
"operations" : 0
},
"docs" : {
"num_docs" : 52001,
"max_doc" : 58573,
"deleted_docs" : 6572
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "13ms",
"total_time_in_millis" : 13
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 0,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "44.9mb",
"size_in_bytes" : 47116017
},
"translog" : {
"id" : 1324591339645,
"operations" : 0
},
"docs" : {
"num_docs" : 52001,
"max_doc" : 58573,
"deleted_docs" : 6572
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "12ms",
"total_time_in_millis" : 12
}
} ],
"1" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 1,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "43.5mb",
"size_in_bytes" : 45616358
},
"translog" : {
"id" : 1324591339818,
"operations" : 0
},
"docs" : {
"num_docs" : 50991,
"max_doc" : 55917,
"deleted_docs" : 4926
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "10ms",
"total_time_in_millis" : 10
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 1,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "43.5mb",
"size_in_bytes" : 45616358
},
"translog" : {
"id" : 1324591339818,
"operations" : 0
},
"docs" : {
"num_docs" : 50991,
"max_doc" : 55917,
"deleted_docs" : 4926
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "13ms",
"total_time_in_millis" : 13
}
} ],
"2" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 2,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "54.7mb",
"size_in_bytes" : 57375684
},
"translog" : {
"id" : 1324591339466,
"operations" : 0
},
"docs" : {
"num_docs" : 60920,
"max_doc" : 70373,
"deleted_docs" : 9453
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "13ms",
"total_time_in_millis" : 13
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "VMc1uC1uRUCnxCZTzE4S9A",
"relocating_node" : null,
"shard" : 2,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "54.7mb",
"size_in_bytes" : 57375684
},
"translog" : {
"id" : 1324591339466,
"operations" : 0
},
"docs" : {
"num_docs" : 60920,
"max_doc" : 70373,
"deleted_docs" : 9453
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "15ms",
"total_time_in_millis" : 15
}
} ],
"3" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 3,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "45.4mb",
"size_in_bytes" : 47665161
},
"translog" : {
"id" : 1324591339663,
"operations" : 0
},
"docs" : {
"num_docs" : 52301,
"max_doc" : 58879,
"deleted_docs" : 6578
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "13ms",
"total_time_in_millis" : 13
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 3,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "45.4mb",
"size_in_bytes" : 47665161
},
"translog" : {
"id" : 1324591339663,
"operations" : 0
},
"docs" : {
"num_docs" : 52301,
"max_doc" : 58879,
"deleted_docs" : 6578
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "14ms",
"total_time_in_millis" : 14
}
} ],
"4" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 4,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "44.1mb",
"size_in_bytes" : 46278154
},
"translog" : {
"id" : 1324591339839,
"operations" : 0
},
"docs" : {
"num_docs" : 51625,
"max_doc" : 56555,
"deleted_docs" : 4930
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "11ms",
"total_time_in_millis" : 11
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 4,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "44.1mb",
"size_in_bytes" : 46278154
},
"translog" : {
"id" : 1324591339839,
"operations" : 0
},
"docs" : {
"num_docs" : 51625,
"max_doc" : 56555,
"deleted_docs" : 4930
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "13ms",
"total_time_in_millis" : 13
}
} ],
"5" : [ {
"routing" : {
"state" : "STARTED",
"primary" : true,
"node" : "pNWhe9zmQjCxw2b7SIJTdw",
"relocating_node" : null,
"shard" : 5,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "54.6mb",
"size_in_bytes" : 57252654
},
"translog" : {
"id" : 1324591339578,
"operations" : 0
},
"docs" : {
"num_docs" : 61079,
"max_doc" : 70253,
"deleted_docs" : 9174
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "6ms",
"total_time_in_millis" : 6
}
}, {
"routing" : {
"state" : "STARTED",
"primary" : false,
"node" : "NHx8JgLkSD-IMuff8c349A",
"relocating_node" : null,
"shard" : 5,
"index" : "places"
},
"state" : "STARTED",
"index" : {
"size" : "54.6mb",
"size_in_bytes" : 57252654
},
"translog" : {
"id" : 1324591339578,
"operations" : 0
},
"docs" : {
"num_docs" : 61079,
"max_doc" : 70253,
"deleted_docs" : 9174
},
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size" : "0b",
"current_size_in_bytes" : 0,
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0,
"total_docs" : 0,
"total_size" : "0b",
"total_size_in_bytes" : 0
},
"refresh" : {
"total" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"flush" : {
"total" : 31,
"total_time" : "7ms",
"total_time_in_millis" : 7
}
} ]
}
}
}
}
John Cwikla
VPE
fwix