4 ES instances, one node each, running in the same box but using different cluster names.
So they won't be working together. But they all will store same type of data.
A proxy running on top of these instances to route my indices to the right node.
ZSF snapshot as backup
I am not complete aware of the ES capabilities and the main reason here would be easily move an entire index to other new servers, as it grows.
I did not know about this, but one question.
With this approach, I would not have to declare my indices in the file?
Because my indices are generated on the fly, so do not know their names before their creation.
I must admit that I also do not understand exactly what you are trying to achieve. The fact that you say that you have 6000 indices in a cluster that size and expect that to grow is however a concern. Each shard in Elasticsearch is a Lucene index and has some resource overhead (memory, file handles, CPU) associated with it. Having that many indices and shards will unnecessarily use up a lot of system resources and is likely to not scale well. Why are you having so many indices?
For each time the same client archives the same website, I have a different snapshot id.
The document content is the website html/css/js/etc
The documents are stored following this path: /website_id/snapshot_id/document_id
Each new index is dedicated to the snapshots from a particular website.
Each website have its own index.
That's why so many indices.
Now, about the stand alone instances. The idea was proposed in order to facilitate the data migration to another server as it grows. For instance, move one entire index data to a another dedicated server. As the data was stored in only one node we could easily move the entire index.
I am trying to understand if this would be viable. That is why I would like to hear from you guys and know if there is other ways achieve that.
Having a separate index per website wastes a lot of resources and will not scale. If the structure of the documents you are indexing for the different website are similar and/or you can control the mappings, I would recommend storing multiple, if not all, websites in a single index. You can then either add filters at the application layer or use filtered aliases when accessing the data. If you always query per website, you can also use routing in order to ensure all documents belonging to a single website reside in a single shard, which can improve query latency and throughput.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.