Ansible Failure Recovery & Monitoring

The KumoScale software cross-domain resiliency solution addresses several failure scenarios by periodically executing the self-healing process by the initiator. In addition, corrective measures may be taken when planning a maintenance procedure, or when a permanent replica loss is detected, by adding a new leg and removing the faulty one.

The Process of Self-Healing

The self-healing process queries the KumoScale Provisioner Service and performs the following tasks:

  1. If an application initiator was connected to a target, it reconnects it.
  2. The replica state in the KumoScale Provisioner service is compared to its state in the multiple device driver (md). If discrepancies are detected, the self-healing process applies corrective measures. A replica marked as deleted will be removed from the md. This handles scenarios where the replica was deleted, for instance, when a storage node goes down.
  3. If a replica was detected missing
    • for a time period greater than the value specified by maxReplicaDowntime and
    • where the volume has less than four replicas (the maximum).

The self-healing will initiate a process where a new replica will be allocated in the most appropriate location, according to the volume’s storage class parameters. The self-healing will also connect to the new replica and synchronize it. The missing replica will be detected and removed on the next self-healing iteration.

Only a single repair is executed on each self-healing iteration.

Planned Maintenance

The KumoScale resiliency solution supports scheduled maintenance operations by providing APIs for adding and removing a replica to a resilient volume. This allows you to remove the replica from the KumoScale device that is shut down and create it on a device on a different rack. When the maintenance operation completes, it is possible to return the replica to its original location in the same manner.

Adding and removing a replica is done via the ks_replica playbook.

Monitoring

Cross domain resiliency uses the mdadm monitoring and reporting mechanism. The KumoScale Provisioner service forwards events and commands from the initiators and the Ansible modules and playbooks to the Syslog server. The mdadm periodically polls the md arrays and reports any detected events to a configured Syslog server (rsyslog).

Configuration and Activation

  1. Ensure that mdadm monitoring is activated by the configure_mdadm_syslog
  2. Ensure that initiator commands and events monitoring is activated when a Syslog server is configured for storage node use.
  3. Configure the Syslog parameters in the vars.yml file:

Parameter

Description

Example

syslog_server

The IP address of the Syslog server.

syslog_server: 192.0.2.0

syslog_port

The port of the Syslog server.

syslog_port: 6514

1.5   Event Notifications

The mdadm monitors the arrays and generates events that can be sent to a Syslog server. These events are sent along with the following parameters:

  • The name of the event as shown in the following table.
  • The affected md device name
  • A related device, if it exists (e.g., a component device that has failed)

These events are categorized by the level of severity: critical, warning, and info.

Table 9. Linux OS Events Reported by the Cross-Domain Resiliency Solution

Event

Description

Severity

DegradedArray

A replica has disconnected (this is not generated when mdadm notices a drive failure).

Critical

Fail

An active component device of an array was marked faulty.

Critical

RebuildNN

The progress of a replica rebuild process in percentage (NN is a zero-based, two-digit number, e.g. 05, 48).

Warning

RebuildFinished

The reconstruction of a replica finished (successfully or aborted).

Warning

NewArray

A new md array has been detected in the /proc/mdstat file.

Info

TestMessage

An array was detected at boot and the --test flag was specified.

Info

NOTE: Each event has an associated array device (e.g. /dev/md1) and possibly an additional device. For Fail, the second device is the relevant component device. Refer to mdadm documentation for additional information regarding the various event states.

 

The following initiator events are forwarded to the Syslog server if it is configured in KumoScale software:

Table 10. Initiator Events Reported by KumoScale Software to the Syslog server

Event

Description

Parameters

Severity

Session Established

An initiator connected to a target.

The initiator's (host's) NQN and the target’s NQN.

Info

Session closed

An initiator disconnected from a target.

The initiator's (host's) NQN and the target’s NQN.

Info

 

Next: Example Ansible Playbooks