Ansible Failure Recovery & Monitoring

The KumoScale software cross-domain resiliency solution addresses several failure scenarios by periodically executing the self-healing process by the initiator. In addition, corrective measures may be taken when planning a maintenance procedure, or when a permanent replica loss is detected, by adding a new leg and removing the faulty one.

The Process of Self-Healing

The self-healing process queries the KumoScale Provisioner Service and performs the following tasks:

  1. If an application initiator was connected to a target, it reconnects it.
  2. The replica state in the KumoScale Provisioner service is compared to its state in the multiple device driver (md). If discrepancies are detected, the self-healing process applies corrective measures. A replica marked as deleted will be removed from the md. This handles scenarios where the replica was deleted, for instance, when a storage node goes down.
  3. If a replica was detected missing
    • for a time period greater than the value specified by maxReplicaDowntime and
    • where the volume has less than four replicas (the maximum).

The self-healing will initiate a process where a new replica will be allocated in the most appropriate location, according to the volume’s storage class parameters. The self-healing will also connect to the new replica and synchronize it. The missing replica will be detected and removed on the next self-healing iteration.

Only a single repair is executed on each self-healing iteration.

Planned Maintenance

The KumoScale resiliency solution supports scheduled maintenance operations by providing APIs for adding and removing a replica to a resilient volume. This allows you to remove the replica from the KumoScale device that is shut down and create it on a device on a different rack. When the maintenance operation completes, it is possible to return the replica to its original location in the same manner.

Adding and removing a replica is done via the ks_replica playbook.


Cross domain resiliency uses the mdadm monitoring and reporting mechanism. The KumoScale Provisioner service forwards events and commands from the initiators and the Ansible modules and playbooks to the Syslog server. The mdadm periodically polls the md arrays and reports any detected events to a configured Syslog server (rsyslog).

Configuration and Activation

  1. Ensure that mdadm monitoring is activated by the configure_mdadm_syslog
  2. Ensure that initiator commands and events monitoring is activated when a Syslog server is configured for storage node use.
  3. Configure the Syslog parameters in the vars.yml file:





The IP address of the Syslog server.



The port of the Syslog server.

syslog_port: 6514

1.5   Event Notifications

The mdadm monitors the arrays and generates events that can be sent to a Syslog server. These events are sent along with the following parameters:

  • The name of the event as shown in the following table.
  • The affected md device name
  • A related device, if it exists (e.g., a component device that has failed)

These events are categorized by the level of severity: critical, warning, and info.

Table 9. Linux OS Events Reported by the Cross-Domain Resiliency Solution





A replica has disconnected (this is not generated when mdadm notices a drive failure).



An active component device of an array was marked faulty.



The progress of a replica rebuild process in percentage (NN is a zero-based, two-digit number, e.g. 05, 48).



The reconstruction of a replica finished (successfully or aborted).



A new md array has been detected in the /proc/mdstat file.



An array was detected at boot and the --test flag was specified.


NOTE: Each event has an associated array device (e.g. /dev/md1) and possibly an additional device. For Fail, the second device is the relevant component device. Refer to mdadm documentation for additional information regarding the various event states.


The following initiator events are forwarded to the Syslog server if it is configured in KumoScale software:

Table 10. Initiator Events Reported by KumoScale Software to the Syslog server





Session Established

An initiator connected to a target.

The initiator's (host's) NQN and the target’s NQN.


Session closed

An initiator disconnected from a target.

The initiator's (host's) NQN and the target’s NQN.



Next: Example Ansible Playbooks