Failure Recovery Process

The cross-domain resiliency solution addresses several failure scenarios (see Failure Recovery Scenarios), by periodically executing the self_healing process by the host. In addition, corrective measures may be taken when planning a maintenance procedure, or when a permanent replica loss is detected, by adding a new leg and removing the faulty one.

5.1 The Process of Self-Healing

The self-healing process queries the KumoScale Provisioner (KumoScale appliances) and performs the following tasks:

  1. If an application host was connected to a target, it reconnects it.
  2. The replica state in the KumoScale Provisioner is compared to its state in the MD. If discrepancies are detected, the self-healing process applies corrective measures. A replica marked as deleted will be removed from the MD. This handles scenarios where the replica was deleted, for instance, when a backend goes down.
  3. A replica was detected missing for a time period > maxReplicaDowntime and where the volume has less than four replicas (the maximum). The self-healing will initiate a process where a new replica will be allocated in the most appropriate location, according to the volume’s storage class parameters. The self-healing will also connect to the new replica and synchronize it.

The missing replica will be detected and removed on the next self-healing iteration.

Only a single repair is executed on each self-healing iteration.

5.2 Planned Maintenance

KumoScale’s resiliency solution supports scheduled maintenance operations by providing APIs for adding and removing a replica to a resilient volume. This allows to remove the replica from the KumoScale appliance that is shut down and create it on an appliance on a different rack. When the maintenance operation completes, it is possible to return the replica to its original location in the same manner.