We are due to deploy Elasticsearch (Along with Graylog) to our Office in London , New York, Seattle & San Francisco
We have a a server in each remote office (Graylog & Elasticsearch). However i have a small question / Problem.
We will be logging quite a lot of information. What i would like to do is:
Main Office : London
Store all information in the cluster in Elasticsearch DB here
New York (Main US Office):
Store all Northern American office Logs in Elasticsearch Database here (NY, Seattle & San Fran logs will be saved in the Elasticsearch DB
(Seattle, Sanfran will store logs ONLY from their office. So Network devices , Syslog messages in their respective office will be stored on their own Elasticsearch database only)
I dont want Seattle & San Fran to hold the whole Elasticsearch DB as the offices are only small and only about 10 devices will be logging to ES / Graylog . Whereas NYC and UK will have > 100 devices/Servers logging to itself..
I want to be able to have UK and NYC holding ALL the information in the ES cluster and Sea and San fran to only hold their own logs but also send it to NYC so we have a backup. Is that possible?
I've seen Shard Allocation filtering but unsure on how to go ahead with it. Sorry if this is confusing it's been a project for > 6 months and ideally want to roll it out in the next 30-40 days.
Thank you for all your help,
Michel
Junior Sys Admin
If you are happy shipping things over the wire then I'd have a cluster per continent, with indices split into per site and per source (ie network, system etc).
I'm looking for similar functionality but for disaster recovery. We have two datacenters, one that houses our DR environment and another that houses our production. Our production environment runs robust servers (SSDs, etc.) while our DR environment would be more spartan with VMs mounting NFS shared (definitely not ideal, but it would allow us to operate our business).
Based on what I've been able to piece together, we'd want a separate "disaster recovery" cluster setup at our DR center that operates completely independently of production, and we'd want to sync data between them somehow.
Is there a way this can be automated on some kind of schedule? Is that something we'd have to do on our own with cron jobs or is there support for this sort of thing natively?
You could feed the same data to multiple clusters (see Kafka, Kafka MirrorMaker) or maybe you could use ES snapshots if not being 100% up to date in secondary DCs is acceptable.
yeah I've been looking around but there doesn't seem to be an "official" way to do it .. most seem to point to doing a snapshot from prod and restore in your "mirrored" environment, or if you need it live to do a "distributed indexing" or something where whenever you index in one cluster you distribute that indexing request to multiple clusters.
We'll probably end up doing something where we regularly snapshot the prod index to a network file share, then periodically restore a full snapshot to DR just to keep it "close" to production. In our case we don't need the DR index to be always 100% caught up with production, just relatively close.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.