Responsibility for infrastructure failures can be a heavy burden. For those on the front lines, a complex recovery process executed under time-pressure can be nerve-wracking. KumoScaleTM software makes recovery from SSD or storage node failure fully automatic. Policies determine the conditions for automatically declaring an unreachable resource permanent, the replacement, and data re-sync, as well as cleaning up any orphan volumes left behind. Full details and progress are logged and policies can be easily changed in anticipation of maintenance events.
Data center automation is a fast-evolving field, and the degree of abstraction has increased markedly over the last few years. Simple command-line scripting tools like Bash and Perl gave way to frameworks like Ansible®, Puppet®, and Chef®. These imperative methods in turn were later matched in popularity by the declarative approach used in Terraform®, for example. And finally, the Kubernetes standard has emerged as the preeminent orchestration framework for containerized application instances. KumoScale software exposes a RESTful API which can be easily adapted to any of the above frameworks. Installation and maintenance examples are available upon request.
KumoScale control plane services are containerized, and are deployed and managed by a small, dedicated Kubernetes cluster, typically consisting of three VMs, or compute nodes, located in separate failure zones. Over time, KumoScale control constructs are moving closer to those of Kubernetes, with the goal that the skills needed to develop operators and maintain applications under the Kubernetes standard translate directly to KumoScale software.
High-volume analytics capture is a standard part of data center operations at scale. The difficulty with using analytics time-series data to diagnose problems and monitor operations is not that there's too little data available, but too much. The information needed is often inextricably co-mingled with irrelevant information that obscures its meaning.
KumoScale central notion of storage classes is the key to separating the data needed from the noise. By assigning a StorageClass to each application use case, telemetry data can be filtered by tenant, and by application run by that tenant. When a customer has performance concerns, this gives SREs visual evidence of exactly what performance they've experienced, and when.