I want to understand if when writing data on the Replica Shards is the data being copied from the Primary Shard itself or is the data provided again from the Master node? I understand that Elasticsearch may want to wait first for the action on the Primary shard to complete successfully before moving to the Replica shards, but is the data itself copied from the Primary Shard or from the Master node?
The current master node does not participate in indexing or querying flows as it just is there to manage the cluster state. Indexing requests are send to the appropriate node holding the primary shard, where it is processed before the data is replicated to the replica shards.
Thank you for the swift reply.
Any chance that you can point me to the classes responsible for copying the newly written data that was first written on the primary shard and now is to be copied from the primary shard to the replica shards?
I am trying to understand if in case the node where the primary shard is being written to is a malicious one {writing false data intentionally}, is there an option to prevent the wrong data from being written to the replica shards as well?
If the Elasticsearch is installed on various servers, not all of them are mine. If for some reason the owner of the server, which hosts one of the slave nodes. A modified "bad" version of the slave (data) node can be running and when it is chosen to be the primary shard for some data, he decides to ruin that data to hurt the system for some reason by changing some detail. If this is done on the initial writing of the primary shard, this ruined data will also propagate to the replica shards {since it comes from the primary shard and not from the master node to the different slave node (primary and replica shards) together.
Can you please point me to the area in the code where the data is passed from the primary shard to the replica shards?
These days there are more community based projects, which may need to reliably store data and retrieve it safely in a distributed system on its members machines. For this to work certain aspects of potential maliciousness should be addressed. This is why I was asking about Elasticsearch.
Any plans to add support to it in future versions?
I will check the packages that you mentioned (thank you again)
I was looking at "src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java" and I see they run in "execute()"-
"primary.perform(request)" and if successful "performOnReplicas(replicaRequest,globalCheckpoint,maxSeqNoOfUpdatesOrDeletes,replicationGroup);"
Both of these actions are invoked from the node with the primary shard and not from the master node?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.