Thanks again for the help. I left it alone last night and came back this
morning and everything was green!
Start time was 2012-09-13 22:09:24,340 and last shard recovery was
at 2012-09-14 01:23:04,679. Around ~3 hours. That falls in line with what
you said so I think everything is fine, I just have a lot of data! The
logs I showed before was when both nodes where shutdown and I restarted a
single node. I can obviously see why that was faster since it didn't have
to verify anything.
Thanks again,
Matt
[2012-09-13 22:09:24,340][INFO ][node] [node1] {0.19.9}[5404]: started
[2012-09-13 22:09:27,973][TRACE][indices.recovery] [node1] [XXX][2]
starting recovery from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true}
[2012-09-13 22:09:28,060][TRACE][indices.recovery] [node1] [XXX][4]
starting recovery from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true}
[2012-09-13 23:25:32,834][DEBUG][indices.recovery] [node1] [XXX][2]
recovery completed from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true},
took[1.2h]
phase1: recovered_files [185] with total_size of [21.2gb], took [1.2h],
throttling_wait [0s]
: reusing_files [32] with total_size of [15gb]
phase2: start took [6.4s]
: recovered [0] transaction log operations, took [31ms]
phase3: recovered [0] transaction log operations, took [3ms]
[2012-09-13 23:25:32,932][TRACE][indices.recovery] [node1] [XXX][3]
starting recovery from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true}
[2012-09-13 23:54:16,998][DEBUG][indices.recovery] [node1] [XXX][4]
recovery completed from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true},
took[1.7h]
phase1: recovered_files [192] with total_size of [21.2gb], took [1.7h],
throttling_wait [0s]
: reusing_files [25] with total_size of [14.3gb]
phase2: start took [3.4s]
: recovered [0] transaction log operations, took [0s]
phase3: recovered [0] transaction log operations, took [1ms]
[2012-09-13 23:54:17,100][TRACE][indices.recovery] [node1] [XXX][0]
starting recovery from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true}
[2012-09-14 00:26:01,777][DEBUG][indices.recovery] [node1] [XXX][3]
recovery completed from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true},
took[1h]
phase1: recovered_files [177] with total_size of [21.2gb], took [1h],
throttling_wait [0s]
: reusing_files [24] with total_size of [14.6gb]
phase2: start took [3.8s]
: recovered [0] transaction log operations, took [0s]
phase3: recovered [0] transaction log operations, took [2ms]
[2012-09-14 00:39:12,757][DEBUG][indices.recovery] [node1] [XXX][0]
recovery completed from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true},
took[44.9m]
phase1: recovered_files [193] with total_size of [21.2gb], took [44.8m],
throttling_wait [0s]
: reusing_files [24] with total_size of [14.6gb]
phase2: start took [4.3s]
: recovered [0] transaction log operations, took [0s]
phase3: recovered [0] transaction log operations, took [3ms]
[2012-09-14 01:23:04,679][DEBUG][indices.recovery] [node1] [XXX][1]
recovery completed from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true},
took[57m]
phase1: recovered_files [216] with total_size of [21.2gb], took [56.9m],
throttling_wait [0s]
: reusing_files [25] with total_size of [14.4gb]
phase2: start took [3.5s]
: recovered [0] transaction log operations, took [0s]
phase3: recovered [0] transaction log operations, took [5ms]
On Friday, September 14, 2012 12:25:59 AM UTC-4, tallpsmith wrote:
also the TRACE level logging for the recover (see 'indices.recovery'
logger in logging.yml config, set it to trace) will give you more info on
details of the decision points on recovery timings (I see DEBUG level
below, so presumably you've tweaked it, maybe only to DEBUG level though?
Paul
On 14 September 2012 12:01, mcot <mbc...@gmail.com <javascript:>> wrote:
The master seems to recover very very quickly:
This is one shard recovery:
[2012-09-13 21:59:16,989][TRACE][index.gateway.local ] [node0]
[0] using existing shard data, translog id [1347044409196]
[2012-09-13 21:59:18,937][DEBUG][index.gateway ] [node0] [0]
recovery completed from local, took [1.9s]
index : files [219] with total_size [21.2gb], took[1ms]
: recovered_files [0] with total_size [0b]
: reusing_files [219] with total_size [21.2gb]
start : took [1.9s], check_index [0s]
translog : number_of_operations [0], took [1ms]
On Thursday, September 13, 2012 9:47:50 PM UTC-4, mcot wrote:
Thanks a lot for the reply. This was just my initial testing and I
think I will soon have available two nodes each with 24GB of RAM and faster
disks. I have reduced the min/max heap to 4GB to see if that was the issue
with recovery. I just restarted the replica node and it appears to be
doing the same thing. It is only recovering two shards while the other 3
shards remain unassigned. Is this the default case? The node that is
recovering has very low cpu, memory and disk usage. Java heap usage
appears to be ~150MB, while top shows 0.7% CPU usage for the process and
iotop shows ~1% disk usage.
On Thursday, September 13, 2012 8:57:38 PM UTC-4, mcot wrote:
Hi, I am a new user to Elastic Search and have spent about two weeks
learning the software. I have built a cluster with two nodes and allocated
each node about 8GB of RAM. I have set the Java Heap min/max to 7GB for
each of the nodes. The nodes sit on the same rack right next to each other
(gigabit switch connection).
My use case requires a fairly massive number of documents. I have
defined a single index with two mapping types and I have bulk indexed 364,
864, 102 documents which are a mix of the two types. The index currently
takes 106.2gb on disk. I do not store any fields and I do not even
store the _source json. In my use case the id of the document that matches
is all that I need to get from queries.
I used the default settings of 5 shards and 1 replica. By default,
Elastic Search put all of the primary shards on one node and all of the
replicas on the second node.
My questions are about performance given the above facts. I see that
cold queries sometimes take 100+ seconds. Do I simply have too much data
for two nodes given the heap size for each node? Did I make a mistake by
only using 5 shards? What is the advantage of having more shards if they
are on the same host? Hot queries which have been cached seem to be fairly
fast which is nice but I see that the heap usage is pretty low until some
queries are performed. Is there any way to do an initial cache when the
node is starting up.
Another question would be about node startup. If I shutdown the
replica node and then restart it, I see that it takes a VERY long time (on
the order of hours) to initialize each shard and bring it fully back to
life (green). Again, I have no idea if I simply have too much data or not
enough shards or something else.
Basically I just want to get a sense of how many documents is "too
much" for a single node and a given ammount of RAM.
Thanks in advance.
--
--