Elasticsearch on ZFS best practice

Patrick_Proniewski · May 13, 2014, 5:39am

Hello,

I'm running an Elasticsearch node on a FreeBSD server, on top of ZFS storage. For now I've considered that ES is smart and manages its own cache, so I've disabled primary cache for data, leaving only metadata being cacheable. Last thing I want is to have data cached twice, one time is ZFS ARC and a second time in application's own cache. I've also disabled compression:

$ zfs get compression,primarycache,recordsize zdata/elasticsearch
NAME PROPERTY VALUE SOURCE
zdata/elasticsearch compression off local
zdata/elasticsearch primarycache metadata local
zdata/elasticsearch recordsize 128K default

It's a general purpose server (web, mysql, mail, ELK, etc.). I'm not looking for absolute best ES performance, I'm looking for best use of my resources.
I have 16 GB RAM, and I plan to put a limit to ARC size (currently consuming 8.2 GB RAM) so I can mlockall ES memory. But I don't think I'll go the RAM-only storage route (http://jprante.github.io/applications/2012/07/26/Mmap-with-Lucene.html) as I'm running only one node.

How can I estimate the amount of memory I must allocate to ES process?

Should I switch primarycache=all back on despite ES already caching data?

What is the best ZFS record/block size to accommodate Elasticsearch/Lucene IOs?

Thanks,
Patrick

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/FBBA84AE-D610-4060-AFBC-FC7D5BA0803F%40patpro.net.
For more options, visit https://groups.google.com/d/optout.

Patrick_Proniewski · May 21, 2014, 9:37am

No one?
On 13 mai 2014, at 07:39, Patrick Proniewski elasticsearch@patpro.net wrote:

Hello,

I'm running an Elasticsearch node on a FreeBSD server, on top of ZFS storage. For now I've considered that ES is smart and manages its own cache, so I've disabled primary cache for data, leaving only metadata being cacheable. Last thing I want is to have data cached twice, one time is ZFS ARC and a second time in application's own cache. I've also disabled compression:

$ zfs get compression,primarycache,recordsize zdata/elasticsearch
NAME PROPERTY VALUE SOURCE
zdata/elasticsearch compression off local
zdata/elasticsearch primarycache metadata local
zdata/elasticsearch recordsize 128K default

It's a general purpose server (web, mysql, mail, ELK, etc.). I'm not looking for absolute best ES performance, I'm looking for best use of my resources.
I have 16 GB RAM, and I plan to put a limit to ARC size (currently consuming 8.2 GB RAM) so I can mlockall ES memory. But I don't think I'll go the RAM-only storage route (http://jprante.github.io/applications/2012/07/26/Mmap-with-Lucene.html) as I'm running only one node.

How can I estimate the amount of memory I must allocate to ES process?

Should I switch primarycache=all back on despite ES already caching data?

What is the best ZFS record/block size to accommodate Elasticsearch/Lucene IOs?

Thanks,
Patrick

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/80091CC6-51BE-4595-8916-EFA0C5B91676%40patpro.net.
For more options, visit https://groups.google.com/d/optout.

jprante · May 21, 2014, 11:49am

There is not much to add

estimating ES process memory really depends on individual requirements
(bulk indexing, field cache, filter/facet, concurrent queries) - just take
a portion of your data, measure memory/CPU/disk I/O, and extrapolate - best
is to add nodes if resources get tight. Rule of thumb is 50% of RAM to ES
heap
you are correct, primarycache=all may buffer more data than required
(useful for maximum ZFS performance). You have already limited the ARC
size. Use mmapfs for ES store, this should work best with
primarycache=metadata
ZFS recordsize for JVM apps like ES should be default which is 4k. Also
with ES, important is to match ZFS recordsize with kernel page size and
sector size of the drive so there is no skew in the number of I/O
operations. Check for yourself if higher values like 8k /16k / 64k / 256k
gets better throughput on ES data folder. On certain striped HW RAID
devices it may be the case, but I doubt it (ZFS internal buffering is
compensating for this effect, write throughput will suffer if recordsize is
too high)
and you should switch off atime on ES data folder

Jörg

On Tue, May 13, 2014 at 7:39 AM, Patrick Proniewski <
elasticsearch@patpro.net> wrote:

Hello,

I'm running an Elasticsearch node on a FreeBSD server, on top of ZFS
storage. For now I've considered that ES is smart and manages its own
cache, so I've disabled primary cache for data, leaving only metadata being
cacheable. Last thing I want is to have data cached twice, one time is ZFS
ARC and a second time in application's own cache. I've also disabled
compression:

$ zfs get compression,primarycache,recordsize zdata/elasticsearch
NAME PROPERTY VALUE SOURCE
zdata/elasticsearch compression off local
zdata/elasticsearch primarycache metadata local
zdata/elasticsearch recordsize 128K default

It's a general purpose server (web, mysql, mail, ELK, etc.). I'm not
looking for absolute best ES performance, I'm looking for best use of my
resources.
I have 16 GB RAM, and I plan to put a limit to ARC size (currently
consuming 8.2 GB RAM) so I can mlockall ES memory. But I don't think I'll
go the RAM-only storage route (<
http://jprante.github.io/applications/2012/07/26/Mmap-with-Lucene.html>)
as I'm running only one node.

How can I estimate the amount of memory I must allocate to ES process?

Should I switch primarycache=all back on despite ES already caching data?

What is the best ZFS record/block size to accommodate Elasticsearch/Lucene
IOs?

Thanks,
Patrick

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/FBBA84AE-D610-4060-AFBC-FC7D5BA0803F%40patpro.net
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFeK_eTvLSEZ3BGgQGmWEzX5Y4v2AdWo8KZoywVe48zBg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Patrick_Proniewski · May 21, 2014, 12:24pm

Hi Jörg,

On 21 mai 2014, at 13:49, joergprante@gmail.com wrote:

estimating ES process memory really depends on individual requirements
(bulk indexing, field cache, filter/facet, concurrent queries) - just take
a portion of your data, measure memory/CPU/disk I/O, and extrapolate - best
is to add nodes if resources get tight. Rule of thumb is 50% of RAM to ES
heap

I'm not really sure to understand what you mean by "just take a portion of your data". Am I supposed to make a query in kibana that will return a known amount of data, and measure mem/cpu/io during the request, then extrapolate to get the amount of those resources needed to return all my data?

you are correct, primarycache=all may buffer more data than required
(useful for maximum ZFS performance). You have already limited the ARC
size. Use mmapfs for ES store, this should work best with
primarycache=metadata

Ok, I was mistaken about mmapfs, I've read some documentation and now it looks a bit clearer to me.

ZFS recordsize for JVM apps like ES should be default which is 4k. Also
with ES, important is to match ZFS recordsize with kernel page size and
sector size of the drive so there is no skew in the number of I/O
operations. Check for yourself if higher values like 8k /16k / 64k / 256k
gets better throughput on ES data folder. On certain striped HW RAID
devices it may be the case, but I doubt it (ZFS internal buffering is
compensating for this effect, write throughput will suffer if recordsize is
too high)

My FS is (should be?) properly aligned on the physical 4K block HDD, so it should be quite efficient to move to a 4k blocksize ZFS volume if it's best for ES.
I'll make some measurements of I/O to make sure performances are not going down.
Every page size is 4k (FreeBSD 9.x):

$ sysctl -a | egrep page_?size:
vm.stats.vm.v_page_size: 4096
hw.pagesize: 4096
p1003_1b.pagesize: 4096

and you should switch off atime on ES data folder

I can do that too.

Thank you for your reply.

On Tue, May 13, 2014 at 7:39 AM, Patrick Proniewski <
elasticsearch@patpro.net> wrote:

Hello,

I'm running an Elasticsearch node on a FreeBSD server, on top of ZFS
storage. For now I've considered that ES is smart and manages its own
cache, so I've disabled primary cache for data, leaving only metadata being
cacheable. Last thing I want is to have data cached twice, one time is ZFS
ARC and a second time in application's own cache. I've also disabled
compression:

$ zfs get compression,primarycache,recordsize zdata/elasticsearch
NAME PROPERTY VALUE SOURCE
zdata/elasticsearch compression off local
zdata/elasticsearch primarycache metadata local
zdata/elasticsearch recordsize 128K default

It's a general purpose server (web, mysql, mail, ELK, etc.). I'm not
looking for absolute best ES performance, I'm looking for best use of my
resources.
I have 16 GB RAM, and I plan to put a limit to ARC size (currently
consuming 8.2 GB RAM) so I can mlockall ES memory. But I don't think I'll
go the RAM-only storage route (<
http://jprante.github.io/applications/2012/07/26/Mmap-with-Lucene.html>)
as I'm running only one node.

How can I estimate the amount of memory I must allocate to ES process?

Should I switch primarycache=all back on despite ES already caching data?

What is the best ZFS record/block size to accommodate Elasticsearch/Lucene
IOs?

Thanks,
Patrick

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/FBBA84AE-D610-4060-AFBC-FC7D5BA0803F%40patpro.net
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFeK_eTvLSEZ3BGgQGmWEzX5Y4v2AdWo8KZoywVe48zBg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/EAB4803E-940C-4DAD-8C29-CBEBB9BCE7CA%40patpro.net.
For more options, visit https://groups.google.com/d/optout.

Patrick_Proniewski · May 21, 2014, 9:49pm

On 21 mai 2014, at 14:24, Patrick Proniewski wrote:

ZFS recordsize for JVM apps like ES should be default which is 4k. Also
with ES, important is to match ZFS recordsize with kernel page size and
sector size of the drive so there is no skew in the number of I/O
operations. Check for yourself if higher values like 8k /16k / 64k / 256k
gets better throughput on ES data folder. On certain striped HW RAID
devices it may be the case, but I doubt it (ZFS internal buffering is
compensating for this effect, write throughput will suffer if recordsize is
too high)

My FS is (should be?) properly aligned on the physical 4K block HDD, so it should be quite efficient to move to a 4k blocksize ZFS volume if it's best for ES.
I'll make some measurements of I/O to make sure performances are not going down.
Every page size is 4k (FreeBSD 9.x):

$ sysctl -a | egrep page_?size:
vm.stats.vm.v_page_size: 4096
hw.pagesize: 4096
p1003_1b.pagesize: 4096

After changing recordsize to 4k and moving away/moving back my data so they are written with the new block size, I see an impressive difference in IO and bandwidth usage. I've tested the same kibana request (get everything, with filter program:apache, for last 30 days)

before (recordsize=128k -> variable)

          capacity     operations    bandwidth

pool alloc free read write read write

zdata 661G 1.17T 3 41 65.2K 267K
zdata 661G 1.17T 554 41 47.7M 351K
zdata 661G 1.17T 424 24 43.5M 725K
zdata 661G 1.17T 465 54 50.4M 838K
zdata 661G 1.17T 2 36 54.8K 179K

after (recordsize=4k -> fixed)

          capacity     operations    bandwidth

pool alloc free read write read write

zdata 661G 1.17T 1 16 6.80K 72.4K
zdata 661G 1.17T 1.46K 15 5.88M 64.4K
zdata 661G 1.17T 3 42 12.4K 243K

Display in Kibana does not feel faster, so I guess I have another bottleneck somewhere (network maybe, it's a remote server, over DSL). On the disk bandwidth side, this is clearly a huge win.

Patrick

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9C24EA73-1178-44B9-AE43-A293BD3F3857%40patpro.net.
For more options, visit https://groups.google.com/d/optout.

bash99 · January 11, 2017, 6:48am

What I want to say is, disable or play around with primarycache in ZFS is really a bad idea.
So set primarycache=metadata is bad/wrong practice for ElasticSearch.

We got a heavy hit on performance with set primarycache=metadata for taking advice from this thread, and Even with a L2ARC (secondarycache=all), performance is same bad.
It tooks a few days to figure out and fix the problem as we a new to both system.

First, from https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-store.html,
mmapfs maps to Lucene MMapDirectory, and we should know mmap depend on OS file cache/VM to read quickly.
"http://elasticsearch-users.115913.n3.nabble.com/How-to-run-elastic-from-memory-td4031249.html"

Second, ZFS only fill l2arc with entry retire from arc, so set secondarycache=all is no use when set primarycache=metadata.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209396
http://weblog.etherized.com/posts/185.html

So we got very low hit rate on arc and l2arc, and random read IO cause delay on our system.

Patrick_Proniewski · January 11, 2017, 8:15am

Hello,

Well, you might want to ignore this thread. Seriously, it's from 2014, based on a no longer supported FreeBSD release and on an ELK version that probably has not so much in common with current one.

And more importantly, don't take answers for granted without considering the context first. I was not looking for state of the art performances, the context was:

jprante · January 11, 2017, 9:27am

Thanks for your insight about primarycache=metadata, which is considered a suitable setting for database workload with high write throughput. For Elasticsearch, however, query workload will typically overweigh other workload patterns. So if ZFS ARC can be adapted to coexist with mmap() cached files of most recent Lucene 6, it is reasonable to use primarycache=all.

Another ZFS-specific issue on Linux is that Linux reads 128k at each random read from a file system, while ZFS block access will be 4k for each call, which will lead to a 32x higher IO random read rate when caching is effectively disabled by primarycache=metadata, resulting in poor performance.

ZFS has quite a learning curve and it took me months to configure ZFS for an Oracle DB on Solaris (which I migrated finally to XFS under Linux).

Topic		Replies	Views
Kindly DE-mistify memory settings Elasticsearch	7	455	July 6, 2017
ES Memory consuption Elasticsearch	13	631	July 6, 2017
ES vs. Lucene memory Elasticsearch	8	2440	July 6, 2017
Any experience with ES and Data Compressing Filesystems? Elasticsearch	12	3424	July 6, 2017
ES/Lucene eating up entire memory! Elasticsearch	9	1952	July 6, 2017

Elasticsearch on ZFS best practice

Related topics