KumoScale™ software supports resiliency using Cross-Domain Data Replication (CDDR). This capability is based on data replication and enables users to flexibly specify the number of replicas on a per-storage-class, or per application, basis. In CDDR, write operations are processed synchronously in parallel (i.e., forwarded to each replica), and then acknowledged back to the application only once it has been committed to non-volatile media at each replica.
This approach supports provisioning resilient volumes across multiple failure domains such as networking failures, Top-of-Rack (ToR) switch downtime, rack power failures, and KumoScale software failures. It supports data resiliency at the data center scale, based on a cloud-native approach, utilizing cloud-native tools and methodologies. More information on the KumoScale solution is available in the KumoScale Cross Domain Resiliency Solution Brief.
This document describes how to implement the KumoScale solution in bare-metal deployments using Ansible™ modules and playbooks.
Before you begin any implementation, ensure that your environment meets the Installation Requirements detailed in the
- KumoScale Installation Guide to install KumoScale software and configure the storage cluster. Note that in a bare-metal environment, the same Ansible server can host both the KumoScale storage cluster and the Ansible modules and playbooks.
- Ansible Installation Prerequisites to install the Ansible module package.
Before using the Ansible module package, users should be familiar with KumoScale software terminology and processes. This information can be found in the following KumoScale software documents at https://kumoscale.kioxia.com/en/documents:
- KumoScale System Overview provides an architectural overview of KumoScale software for all environments including bare-metal.
- KumoScale User Manual describes KumoScale software features and the procedures for implementing a scale-out storage system built on the NVMe® protocol
- KumoScale Release Notes lists the changes made to the latest version of KumoScale software.
- KumoScale Cross Domain Resiliency Solution Brief describes the KumoScale CDDR solution for all environments including bare-metal.
This document is written for storage administrators. It is assumed that the reader has a working knowledge of storage and networking. Topics covered in this guide include:
- KumoScale Bare-Metal System Overview: describes the CCDR solution architecture using KumoScale software and Ansible modules.
- Ansible Modules Installation and Configuration: provides step-by-step procedures for installing the KumoScale software Ansible module package.
- Ansible Module Package: lists and provides usage for all package components, modules, playbooks, and functions.
- Volume Management Example Playbooks Sequence: explains how to use Ansible playbooks to manage volumes to provision storage.
- Ansible Failure Recovery and Monitoring explains how KumoScale software and Ansible modules address different failure scenarios and support monitoring to ensure cross-domain resiliency.
In a bare-metal deployment infrastructure, cross-domain resiliency is deployed using Ansible modules and playbooks. The KumoScale Software Ansible module package provides the modules, example playbooks, variable file, and initiator files for a KumoScale software implementation that can be easily integrated with data center DevOps procedures.
The KumoScale resiliency solution makes use of a standard Linux™ RAID module (i.e. RAID 0 or RAID 1) located on the host servers. The Linux RAID service is configured and monitored as part of the resiliency solution. Due to the fact that volume resiliency is implemented at a data center level across multiple failure domains, KumoScale software does not implement resiliency on the drive (i.e. using RAID 5 or RAID 6).
Provisioning and monitoring the resiliency configuration are external to KumoScale software:
- Provisioning typically runs as a service on a server designated by the data center DevOps team.
- Monitoring the resiliency solution is conducted by the software running on the application’s host, also known as the initiator. It is based on a telemetric feed that is sent to a time-series database (TSDB). A monitoring and alerting application is designed on top of the TSDB.
KumoScale Cross-Domain Topology
An example of a KumoScale cross-domain resiliency configuration is shown in Figure 1. In this configuration:
- Triple volume replicas are provisioned in separate failure domains.
- ToR switches connect KumoScale storage nodes to application servers and connect different racks via a spine router.
- An Ansible provisioning script is executed to provision this solution.
- The KumoScale Provisioner Service interfaces with Ansible playbooks and provisions storage across the appropriate KumoScale storage nodes.
- The volume replicas are maintained in sync and are re-synced upon recovery from failure in one of the domains.
- A monitoring and alerting application based on a TSDB maintains the state of the solution and generates alerts as required.
Figure 1. Example Deployment of a Cross-Domain Resilient Volume
KumoScale Provisioner Service
The KumoScale Provisioner service, shown in Figure 2, is a service which interfaces between the data center provisioning framework and the KumoScale software deployment. The main functions of the KumoScale Provisioner service are to:
- Maintain the KumoScale software deployed inventory information.
- Maintain information regarding volume mapping and other statistics across all KumoScale storage nodes.
- Receive storage provisioning requests from the provisioning framework, specifying the required capacity and the storage class.
- Determine the optimal provisioning scenario across the KumoScale software inventory.
- Provision the requested storage resources.
Figure 2. KumoScale Provisioner Service
Ansible Module Package discusses how the playbooks and modules are used for provisioning storage.
Failure Recovery Scenarios
The KumoScale resiliency solution is designed to recover from several failure scenarios:
- Short-term disconnect between application server and storage volume.
- Long-term disconnect, such as application server boot.
- Long-term and permanent disconnect due to a network issue, such as ToR maintenance, KumoScale software upgrade, Rack power-down.
Data resiliency is maintained via a Linux RAID module in the application server. This layer is managed and monitored as part of the resiliency solution. It can be configured to maintain two or three volume replicas, and to re-sync the replicas upon recovery from a failure.
These are discussed further in Ansible Failure Recovery and Monitoring.
Monitoring and Alerts
The cross-domain resiliency solution is continuously monitored, and alerts are generated upon certain events. Telemetric data and certain events, such as degradation in the state of a resilient volume are sent to and stored in a Syslog server.
This topic is discussed further in Ansible Failure Recovery and Monitoring.