New User -- Index Settings Reccomdendations and Suggestions


(mcot) #1

Hi, I am a new user to Elastic Search and have spent about two weeks
learning the software. I have built a cluster with two nodes and allocated
each node about 8GB of RAM. I have set the Java Heap min/max to 7GB for
each of the nodes. The nodes sit on the same rack right next to each other
(gigabit switch connection).

My use case requires a fairly massive number of documents. I have defined
a single index with two mapping types and I have bulk indexed 364, 864,
102 documents which are a mix of the two types. The index currently takes 106.2gb
on disk. I do not store any fields and I do not even store the _source
json. In my use case the id of the document that matches is all that I
need to get from queries.

I used the default settings of 5 shards and 1 replica. By default, Elastic
Search put all of the primary shards on one node and all of the replicas on
the second node.

My questions are about performance given the above facts. I see that cold
queries sometimes take 100+ seconds. Do I simply have too much data for
two nodes given the heap size for each node? Did I make a mistake by only
using 5 shards? What is the advantage of having more shards if they are on
the same host? Hot queries which have been cached seem to be fairly fast
which is nice but I see that the heap usage is pretty low until some
queries are performed. Is there any way to do an initial cache when the
node is starting up.

Another question would be about node startup. If I shutdown the replica
node and then restart it, I see that it takes a VERY long time (on the
order of hours) to initialize each shard and bring it fully back to life
(green). Again, I have no idea if I simply have too much data or not
enough shards or something else.

Basically I just want to get a sense of how many documents is "too much"
for a single node and a given ammount of RAM.

Thanks in advance.

--


(Paul Smith) #2

On 14 September 2012 10:57, mcot mbc8434@gmail.com wrote:

Hi, I am a new user to Elastic Search and have spent about two weeks
learning the software. I have built a cluster with two nodes and allocated
each node about 8GB of RAM. I have set the Java Heap min/max to 7GB for
each of the nodes. The nodes sit on the same rack right next to each other
(gigabit switch connection).

If the physical RAM is 8Gb, and you've given ES a 7Gb Heap, you're going to
swap. A 7Gb heap will end up with, roughly, a 10Gb process running on the
OS (there's more to the memory allocated by Java than just the heap). You
then leave (less than) zero RAM available for the OS to cache disk pages
where the index lives on. You probably should start with 2x the physical
RAM for the heap, so 16Gb RAM for the box, and an 7GB java heap should
leave a good chunk to cache disk pages. Indexes absolutely love RAM.

My use case requires a fairly massive number of documents. I have defined
a single index with two mapping types and I have bulk indexed 364, 864,
102 documents which are a mix of the two types. The index currently takes 106.2gb
on disk. I do not store any fields and I do not even store the _source
json. In my use case the id of the document that matches is all that I
need to get from queries.

that's a lot for just 2 nodes I have to say. You should use the _segments
API to inspect how big the shards are, if any shard has segments >10Gb,
which I'm betting they would be if the total size is 106Gb over 5 shards,
I'd say you'll get better performance by spreading over more shards and
more nodes. If you just increase the shard count though, with 2 nodes
you're probably not going to see a lot of improvement, unless you can take
advantage of the Routing feature for both indexing and searching so your
searches can target specific shards and minimise cross-shard search
aggregations, but this may not be practical for your use case. (check the
ES docs about routing, it's a good trick if your search pattern can work
that way).

I used the default settings of 5 shards and 1 replica. By default,
Elastic Search put all of the primary shards on one node and all of the
replicas on the second node.

My questions are about performance given the above facts. I see that cold
queries sometimes take 100+ seconds. Do I simply have too much data for
two nodes given the heap size for each node? Did I make a mistake by only
using 5 shards? What is the advantage of having more shards if they are on
the same host? Hot queries which have been cached seem to be fairly fast
which is nice but I see that the heap usage is pretty low until some
queries are performed. Is there any way to do an initial cache when the
node is starting up.

ES 0.20 will have a shard warmer API to allow this, it should cover both a
node startup, and during indexing to warm up newer segments. 100 seconds
for the first query is long, what sort of query's are you doing? The first
query will trigger segment loading which brings in ports of the index
segment into RAM, so your disk performance could be a factor scanning the
large segments. How is the disk configured? As said before, you've got no
RAM at all spare for disk caching, so you could well be swapping.

Another question would be about node startup. If I shutdown the replica
node and then restart it, I see that it takes a VERY long time (on the
order of hours) to initialize each shard and bring it fully back to life
(green). Again, I have no idea if I simply have too much data or not
enough shards or something else.

Bringing the node back up will cause recovery, I'm presuming you're running
the default Local gateway, so the node that is starting will recover
locally, and scan the local disk to try and match up which segments of each
shard it has locally that match the primary from the existing node, so disk
IO performance is critical here, and also, again, you probably have no RAM
available for disk cache, so I'm betting it's even worse.

Basically I just want to get a sense of how many documents is "too much"
for a single node and a given ammount of RAM.

It's a pretty big index, I think it's too big for 2 nodes unless you can
throw an awful lot more RAM at each box. If you have an index of this size,
you should be investing in getting good performance metric collection and
analytics for the Hardware, OS, ES Java Process, and the ES metrics
themselves, Sematext's SPM for example, and using BigDesk. Feel free to
post screenshots or more details from elasticsearch-head and/or bigdesk so
we can comment further, but basically I think you probably going to need
more nodes.

regards,

Paul Smith

--


(mcot) #3

Thanks a lot for the reply. This was just my initial testing and I think I
will soon have available two nodes each with 24GB of RAM and faster disks.
I have reduced the min/max heap to 4GB to see if that was the issue with
recovery. I just restarted the replica node and it appears to be doing
the same thing. It is only recovering two shards while the other 3 shards
remain unassigned. Is this the default case? The node that is recovering
has very low cpu, memory and disk usage. Java heap usage appears to be
~150MB, while top shows 0.7% CPU usage for the process and iotop shows ~1%
disk usage.

On Thursday, September 13, 2012 8:57:38 PM UTC-4, mcot wrote:

Hi, I am a new user to Elastic Search and have spent about two weeks
learning the software. I have built a cluster with two nodes and allocated
each node about 8GB of RAM. I have set the Java Heap min/max to 7GB for
each of the nodes. The nodes sit on the same rack right next to each other
(gigabit switch connection).

My use case requires a fairly massive number of documents. I have defined
a single index with two mapping types and I have bulk indexed 364, 864,
102 documents which are a mix of the two types. The index currently takes 106.2gb
on disk. I do not store any fields and I do not even store the _source
json. In my use case the id of the document that matches is all that I
need to get from queries.

I used the default settings of 5 shards and 1 replica. By default,
Elastic Search put all of the primary shards on one node and all of the
replicas on the second node.

My questions are about performance given the above facts. I see that cold
queries sometimes take 100+ seconds. Do I simply have too much data for
two nodes given the heap size for each node? Did I make a mistake by only
using 5 shards? What is the advantage of having more shards if they are on
the same host? Hot queries which have been cached seem to be fairly fast
which is nice but I see that the heap usage is pretty low until some
queries are performed. Is there any way to do an initial cache when the
node is starting up.

Another question would be about node startup. If I shutdown the replica
node and then restart it, I see that it takes a VERY long time (on the
order of hours) to initialize each shard and bring it fully back to life
(green). Again, I have no idea if I simply have too much data or not
enough shards or something else.

Basically I just want to get a sense of how many documents is "too much"
for a single node and a given ammount of RAM.

Thanks in advance.

--


(mcot) #4

The master seems to recover very very quickly:

This is one shard recovery:

[2012-09-13 21:59:16,989][TRACE][index.gateway.local ] [node0] [] [0]
using existing shard data, translog id [1347044409196]
[2012-09-13 21:59:18,937][DEBUG][index.gateway ] [node0] [][0]
recovery completed from local, took [1.9s]
index : files [219] with total_size [21.2gb], took[1ms]
: recovered_files [0] with total_size [0b]
: reusing_files [219] with total_size [21.2gb]
start : took [1.9s], check_index [0s]
translog : number_of_operations [0], took [1ms]

On Thursday, September 13, 2012 9:47:50 PM UTC-4, mcot wrote:

Thanks a lot for the reply. This was just my initial testing and I think
I will soon have available two nodes each with 24GB of RAM and faster
disks. I have reduced the min/max heap to 4GB to see if that was the issue
with recovery. I just restarted the replica node and it appears to be
doing the same thing. It is only recovering two shards while the other 3
shards remain unassigned. Is this the default case? The node that is
recovering has very low cpu, memory and disk usage. Java heap usage
appears to be ~150MB, while top shows 0.7% CPU usage for the process and
iotop shows ~1% disk usage.

On Thursday, September 13, 2012 8:57:38 PM UTC-4, mcot wrote:

Hi, I am a new user to Elastic Search and have spent about two weeks
learning the software. I have built a cluster with two nodes and allocated
each node about 8GB of RAM. I have set the Java Heap min/max to 7GB for
each of the nodes. The nodes sit on the same rack right next to each other
(gigabit switch connection).

My use case requires a fairly massive number of documents. I have
defined a single index with two mapping types and I have bulk indexed 364,
864, 102 documents which are a mix of the two types. The index currently
takes 106.2gb on disk. I do not store any fields and I do not even
store the _source json. In my use case the id of the document that matches
is all that I need to get from queries.

I used the default settings of 5 shards and 1 replica. By default,
Elastic Search put all of the primary shards on one node and all of the
replicas on the second node.

My questions are about performance given the above facts. I see that cold
queries sometimes take 100+ seconds. Do I simply have too much data for
two nodes given the heap size for each node? Did I make a mistake by only
using 5 shards? What is the advantage of having more shards if they are on
the same host? Hot queries which have been cached seem to be fairly fast
which is nice but I see that the heap usage is pretty low until some
queries are performed. Is there any way to do an initial cache when the
node is starting up.

Another question would be about node startup. If I shutdown the replica
node and then restart it, I see that it takes a VERY long time (on the
order of hours) to initialize each shard and bring it fully back to life
(green). Again, I have no idea if I simply have too much data or not
enough shards or something else.

Basically I just want to get a sense of how many documents is "too much"
for a single node and a given ammount of RAM.

Thanks in advance.

--


(Paul Smith) #5

On 14 September 2012 11:47, mcot mbc8434@gmail.com wrote:

Thanks a lot for the reply. This was just my initial testing and I think
I will soon have available two nodes each with 24GB of RAM and faster
disks. I have reduced the min/max heap to 4GB to see if that was the issue
with recovery. I just restarted the replica node and it appears to be
doing the same thing. It is only recovering two shards while the other 3
shards remain unassigned. Is this the default case? The node that is
recovering has very low cpu, memory and disk usage. Java heap usage
appears to be ~150MB, while top shows 0.7% CPU usage for the process and
iotop shows ~1% disk usage.

the default is 2 shards-at-a-time-recovery, you can tune that, but only if
you think your cpu/disk/network traffic can cope (it can easily overwhelm a
running cluster if you let it go wild). your shards a big though, so each
recovering shard is going to take a while to verify against the current
primary shard in the other cluster. see
http://www.elasticsearch.org/guide/reference/modules/cluster.html

the other bit that affects the time to recovery is listed here:
http://www.elasticsearch.org/guide/reference/modules/gateway/local.html if
you can give ES the hint that there's only 2 nodes, you can set the recover
after nodes to 2 say, knowing you have everyone and can start straight
away. that won't speed up the 'hours' though, I can't see recovery taking
hours unless one or both of the nodes is doing some serious
cpu/disk/network traffic. Have you looked at the network traffic stats?

if there were no mutations on the existing primary node while the replica
node you shutdown, then it should be quick to recover because the local
shard should be practically identical to the primary shard, so not much to
send back from the primary shard to the newly started replica.

The RAM will help you with the queries, less so with the recovery, unless
you were swapping (and you must have been close to swapping with the 7Gb
heap allocated).

It's different for everyone because different query types require different
memory requirements, but we have a ~70million item, which is about ~70Gb in
size, across 2 nodes, and only allocate 2Gb RAM (yes two) to each ES
process, and the local box has 32Gb each (each box is doing other things
though) and recovery only takes a few minutes, and searches warm up very
quickly and are on avg about 11ms.

Paul

--


(Paul Smith) #6

On 14 September 2012 12:01, mcot mbc8434@gmail.com wrote:

The master seems to recover very very quickly:

This is one shard recovery:

[2012-09-13 21:59:16,989][TRACE][index.gateway.local ] [node0] [] [0]
using existing shard data, translog id [1347044409196]
[2012-09-13 21:59:18,937][DEBUG][index.gateway ] [node0] [][0]
recovery completed from local, took [1.9s]
index : files [219] with total_size [21.2gb], took[1ms]
: recovered_files [0] with total_size [0b]
: reusing_files [219] with total_size [21.2gb]
start : took [1.9s], check_index [0s]
translog : number_of_operations [0], took [1ms]

when you say master, is this the recovery when you are going from a cold,
cluster-completely-shutdown, to a single node startup (and hence, this 1st
node becomes the master) ? Since there's no other node to discuss whether
it has the correct state or not, it has to assume what it has is what it
is, so no need to really 'recover' anything other than taking the local
disk gateway and comparing with the running index state directory (which
should be identical in a normal case).

When the 2nd node starts up, it won't be the master, it'll detect the other
node. This 2nd node (the replica we'll call it) will have some local
state, potentially state, but must assume the other node (the master) has
better information about particular shard state, so has to treat the master
as the single point of truth. Any differences between the master's shard
and this replica shard then need to be reconciled.

I'm not exactly sure how the comparison is done on a segment level, but the
replica will have to ensure it has the correct shard copy transferred
locally, it might choose to work out which segments are the same, and ship
only the differences, I would have to check in the code how it does it,
unless Shay or someone else familiar can save me the trouble to explain the
algorithm quickly...?

Paul

--


(Paul Smith) #7

also the TRACE level logging for the recover (see 'indices.recovery' logger
in logging.yml config, set it to trace) will give you more info on details
of the decision points on recovery timings (I see DEBUG level below, so
presumably you've tweaked it, maybe only to DEBUG level though?

Paul

On 14 September 2012 12:01, mcot mbc8434@gmail.com wrote:

The master seems to recover very very quickly:

This is one shard recovery:

[2012-09-13 21:59:16,989][TRACE][index.gateway.local ] [node0] [] [0]
using existing shard data, translog id [1347044409196]
[2012-09-13 21:59:18,937][DEBUG][index.gateway ] [node0] [][0]
recovery completed from local, took [1.9s]
index : files [219] with total_size [21.2gb], took[1ms]
: recovered_files [0] with total_size [0b]
: reusing_files [219] with total_size [21.2gb]
start : took [1.9s], check_index [0s]
translog : number_of_operations [0], took [1ms]

On Thursday, September 13, 2012 9:47:50 PM UTC-4, mcot wrote:

Thanks a lot for the reply. This was just my initial testing and I think
I will soon have available two nodes each with 24GB of RAM and faster
disks. I have reduced the min/max heap to 4GB to see if that was the issue
with recovery. I just restarted the replica node and it appears to be
doing the same thing. It is only recovering two shards while the other 3
shards remain unassigned. Is this the default case? The node that is
recovering has very low cpu, memory and disk usage. Java heap usage
appears to be ~150MB, while top shows 0.7% CPU usage for the process and
iotop shows ~1% disk usage.

On Thursday, September 13, 2012 8:57:38 PM UTC-4, mcot wrote:

Hi, I am a new user to Elastic Search and have spent about two weeks
learning the software. I have built a cluster with two nodes and allocated
each node about 8GB of RAM. I have set the Java Heap min/max to 7GB for
each of the nodes. The nodes sit on the same rack right next to each other
(gigabit switch connection).

My use case requires a fairly massive number of documents. I have
defined a single index with two mapping types and I have bulk indexed 364,
864, 102 documents which are a mix of the two types. The index currently
takes 106.2gb on disk. I do not store any fields and I do not even
store the _source json. In my use case the id of the document that matches
is all that I need to get from queries.

I used the default settings of 5 shards and 1 replica. By default,
Elastic Search put all of the primary shards on one node and all of the
replicas on the second node.

My questions are about performance given the above facts. I see that
cold queries sometimes take 100+ seconds. Do I simply have too much data
for two nodes given the heap size for each node? Did I make a mistake by
only using 5 shards? What is the advantage of having more shards if they
are on the same host? Hot queries which have been cached seem to be fairly
fast which is nice but I see that the heap usage is pretty low until some
queries are performed. Is there any way to do an initial cache when the
node is starting up.

Another question would be about node startup. If I shutdown the replica
node and then restart it, I see that it takes a VERY long time (on the
order of hours) to initialize each shard and bring it fully back to life
(green). Again, I have no idea if I simply have too much data or not
enough shards or something else.

Basically I just want to get a sense of how many documents is "too much"
for a single node and a given ammount of RAM.

Thanks in advance.

--

--


(mcot) #8

Thanks again for the help. I left it alone last night and came back this
morning and everything was green!

Start time was 2012-09-13 22:09:24,340 and last shard recovery was
at 2012-09-14 01:23:04,679. Around ~3 hours. That falls in line with what
you said so I think everything is fine, I just have a lot of data! The
logs I showed before was when both nodes where shutdown and I restarted a
single node. I can obviously see why that was faster since it didn't have
to verify anything.

Thanks again,
Matt

[2012-09-13 22:09:24,340][INFO ][node] [node1] {0.19.9}[5404]: started

[2012-09-13 22:09:27,973][TRACE][indices.recovery] [node1] [XXX][2]
starting recovery from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true}
[2012-09-13 22:09:28,060][TRACE][indices.recovery] [node1] [XXX][4]
starting recovery from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true}

[2012-09-13 23:25:32,834][DEBUG][indices.recovery] [node1] [XXX][2]
recovery completed from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true},
took[1.2h]

phase1: recovered_files [185] with total_size of [21.2gb], took [1.2h],
throttling_wait [0s]
: reusing_files [32] with total_size of [15gb]
phase2: start took [6.4s]
: recovered [0] transaction log operations, took [31ms]
phase3: recovered [0] transaction log operations, took [3ms]

[2012-09-13 23:25:32,932][TRACE][indices.recovery] [node1] [XXX][3]
starting recovery from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true}

[2012-09-13 23:54:16,998][DEBUG][indices.recovery] [node1] [XXX][4]
recovery completed from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true},
took[1.7h]
phase1: recovered_files [192] with total_size of [21.2gb], took [1.7h],
throttling_wait [0s]
: reusing_files [25] with total_size of [14.3gb]
phase2: start took [3.4s]
: recovered [0] transaction log operations, took [0s]
phase3: recovered [0] transaction log operations, took [1ms]

[2012-09-13 23:54:17,100][TRACE][indices.recovery] [node1] [XXX][0]
starting recovery from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true}

[2012-09-14 00:26:01,777][DEBUG][indices.recovery] [node1] [XXX][3]
recovery completed from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true},
took[1h]
phase1: recovered_files [177] with total_size of [21.2gb], took [1h],
throttling_wait [0s]
: reusing_files [24] with total_size of [14.6gb]
phase2: start took [3.8s]
: recovered [0] transaction log operations, took [0s]
phase3: recovered [0] transaction log operations, took [2ms]

[2012-09-14 00:39:12,757][DEBUG][indices.recovery] [node1] [XXX][0]
recovery completed from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true},
took[44.9m]
phase1: recovered_files [193] with total_size of [21.2gb], took [44.8m],
throttling_wait [0s]
: reusing_files [24] with total_size of [14.6gb]
phase2: start took [4.3s]
: recovered [0] transaction log operations, took [0s]
phase3: recovered [0] transaction log operations, took [3ms]

[2012-09-14 01:23:04,679][DEBUG][indices.recovery] [node1] [XXX][1]
recovery completed from
[node0][tRXvvpH_S0uwGpFXjYiEbA][inet[/192.168.120.57:9300]]{master=true},
took[57m]
phase1: recovered_files [216] with total_size of [21.2gb], took [56.9m],
throttling_wait [0s]
: reusing_files [25] with total_size of [14.4gb]
phase2: start took [3.5s]
: recovered [0] transaction log operations, took [0s]
phase3: recovered [0] transaction log operations, took [5ms]

On Friday, September 14, 2012 12:25:59 AM UTC-4, tallpsmith wrote:

also the TRACE level logging for the recover (see 'indices.recovery'
logger in logging.yml config, set it to trace) will give you more info on
details of the decision points on recovery timings (I see DEBUG level
below, so presumably you've tweaked it, maybe only to DEBUG level though?

Paul

On 14 September 2012 12:01, mcot <mbc...@gmail.com <javascript:>> wrote:

The master seems to recover very very quickly:

This is one shard recovery:

[2012-09-13 21:59:16,989][TRACE][index.gateway.local ] [node0] []
[0] using existing shard data, translog id [1347044409196]
[2012-09-13 21:59:18,937][DEBUG][index.gateway ] [node0] [][0]
recovery completed from local, took [1.9s]
index : files [219] with total_size [21.2gb], took[1ms]
: recovered_files [0] with total_size [0b]
: reusing_files [219] with total_size [21.2gb]
start : took [1.9s], check_index [0s]
translog : number_of_operations [0], took [1ms]

On Thursday, September 13, 2012 9:47:50 PM UTC-4, mcot wrote:

Thanks a lot for the reply. This was just my initial testing and I
think I will soon have available two nodes each with 24GB of RAM and faster
disks. I have reduced the min/max heap to 4GB to see if that was the issue
with recovery. I just restarted the replica node and it appears to be
doing the same thing. It is only recovering two shards while the other 3
shards remain unassigned. Is this the default case? The node that is
recovering has very low cpu, memory and disk usage. Java heap usage
appears to be ~150MB, while top shows 0.7% CPU usage for the process and
iotop shows ~1% disk usage.

On Thursday, September 13, 2012 8:57:38 PM UTC-4, mcot wrote:

Hi, I am a new user to Elastic Search and have spent about two weeks
learning the software. I have built a cluster with two nodes and allocated
each node about 8GB of RAM. I have set the Java Heap min/max to 7GB for
each of the nodes. The nodes sit on the same rack right next to each other
(gigabit switch connection).

My use case requires a fairly massive number of documents. I have
defined a single index with two mapping types and I have bulk indexed 364,
864, 102 documents which are a mix of the two types. The index currently
takes 106.2gb on disk. I do not store any fields and I do not even
store the _source json. In my use case the id of the document that matches
is all that I need to get from queries.

I used the default settings of 5 shards and 1 replica. By default,
Elastic Search put all of the primary shards on one node and all of the
replicas on the second node.

My questions are about performance given the above facts. I see that
cold queries sometimes take 100+ seconds. Do I simply have too much data
for two nodes given the heap size for each node? Did I make a mistake by
only using 5 shards? What is the advantage of having more shards if they
are on the same host? Hot queries which have been cached seem to be fairly
fast which is nice but I see that the heap usage is pretty low until some
queries are performed. Is there any way to do an initial cache when the
node is starting up.

Another question would be about node startup. If I shutdown the
replica node and then restart it, I see that it takes a VERY long time (on
the order of hours) to initialize each shard and bring it fully back to
life (green). Again, I have no idea if I simply have too much data or not
enough shards or something else.

Basically I just want to get a sense of how many documents is "too
much" for a single node and a given ammount of RAM.

Thanks in advance.

--

--


(system) #9