Hi,
I've the ELK stack installed via docker having in particular 3 ES nodes and 1 LogStash node.
My question is if there is a way to have the data ingested from a specific LogStash pipeline saved in a specific directory inside my host machine avoiding /usr/share/elasticsearch/data/indices which is binded to host machine folder.
In other words I want all the data inside a specific index to be saved onto a specific NAS.
I've tried using symbolic links but without success. I moved the index folder (named with the index uid) inside /nas/node01/ and then I performed
this for each data node. Please note that /hostPath is binded onto /usr/share/elasticsearch and /nas is binded onto /nas.
The moment I restart the docker-compose from Kibana the index has a grey status bullet point.
When you configure Elasticsearch you configure a data path, this is the place where all data will be saved, you can configure it to be in your NAS if you want, but you cannot save specific data in different places, everything will be saved on the same data path.
If @NikoCosmico01 considers saving all indices to the NAS as an acceptable compromise, that is indeed possible. But, even that has risks, as the NAS would need to be a very snappy NAS for this to be sensible and performant, each node needs its own directory on the NAS, and the NAS (and network path to/from) is then effectively a single point of failure.
I often ask the "why" question back - it sometimes leads to a better idea, or even uncovers some misunderstanding of how stuff works.
So, why do you want that specific index on the NAS, but the rest not on the NAS?
My actual data path leads to SSD disks with a limited capacity and the data on these needs to be highly available both for fast data retrieval and data retention. At the same time I have reading access to another SIEM in my org, not manageable by me due to bureoucracy, from whose I get a ton of data that I want to use for ML training jobs. I do not want the latter logs to end up on my main SSDs also because I do not need these to be higly available. I want to precise it is a non prod environment.
So your logic makes sense and I dont see a simple alternative approach jumping out to me. You can add new clusters, even just add a single node cluster whose data directory is on your NAS (all caveats around using remote storage apply), where you can dump your data for later use / ML training.
But with your existing cluster and 2 such different requirements there's no sensible way to achieve both simultaneously that I can see.
At the end I have been able to achieve what I wanted by adding another node, binding it to the NAS directory and assigning this node the only role of data_warm. Then I have created a Lifecycle Policy that moves immediately all the data to Warm phase after 0 minutes and assigned that policy to my desired index.
I am aware that I can no longer use the warm phase for other indexes but at the moment I am only using hot and cold ones.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.