in my ES cluster i index data on a daily basis, and i plan to add SSD disks to the ES nodes because the spinning disks are struggling with the I/O (about 3TB/day).
i want the indexing to be done on the SSD disks, and after the index "closes" and indexing is done on that index i want to move it to HDD.
i know its possible to add different data paths in path.data, but what i dont know is, can it be done online? after i add the data path, can i just move the index to the second path and ES will automatically detect it? should i close the index before the move? disable flush ? what is the correct procedure to perform something like that?
Yeah, don't do it that way. Multiple data paths are kind of like "index structure aware raid" but they don't work super well in 1.x because files aren't always scattered in ways that make recovery work well. I prefer to just use software raid over multiple data paths personally but there are other Elastic employees who disagree there.
As Mike says, the usual way to have two different types of storage is to use two different nodes and use shard allocation filtering to keep the indexes that you are writing to on your "hot" nodes and move other indexes to "cold" nodes.
One crucial thing here: you can run two elasticsearch nodes on the same physical machine. They just run as separate processes. The deb and rpm aren't rigged out for it but you can do the file copies manually if you are comfortable with that kind of thing. Its probably simpler to just get three whole new machines with two SSDs in raid 0 and call them the "hot" nodes rather than try to share the same machines. They won't have to share a page cache and it lets you bring even more RAM to bear on the problem which is almost always a good thing.
When you do upgrade to 2.x watch out for synchronous commits being on by default. You'll notice a performance hit, significant if you are using small bulk sizes. Its ~7% for large-ish bulk sizes which was deemed worth the safety it provides.
so just to make sure i understand, i have no have additional nodes that use the SSD disks (on the same server or a dedicated server), give some kind of arbitrary tags to them like: "node.disk_type ssd", set the working index to work with it with: "index.routing.allocation.include.disk_type" : "ssd", and when im done with the "hot" index i re-allocate it to the hdd node with 'index.routing.allocation.include.disk_type" : "hdd"' ?
will it re-allocate the index online? can it be access for queries? should i flush/close/open the index before the re-allocation?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.