KumoScale
KumoScale

Maximizing SSD Performance

As one of the world's largest SSD manufacturers, KIOXIA has a great deal of expertise in NVMe and NVMe-oF technology — expertise KumoScaleTM technology leverages in service of extracting all of the available performance, no matter which manufacturer or model of drives are used.

What's inside.

Each NAND Flash die in recent generation SSDs can hold 32 or 64 Gigabytes of raw data. When using 32GB die, we need about 32 of them per Terabyte (TB). Flash die are typically divided into two or four planes, which are separate storage regions which operate somewhat independently, and can be treated for many purposes as if they were separate ICs. Each plane can process a single command (read, write, or erase) at a time.  For a hypothetical one TB drive comprising 32 two-plane die, it can at most process 64 read, write, or erase commands simultaneously.

The Read-Only Case

Without predictive knowledge passed from an application, or in a multiple client environment, addresses of reads typically exhibit a uniform random distribution spanning the entire drive capacity. This means that the chances of any given read or write being serviced by any particular one of the 64 planes is the same, 1/64.  If all commands are reads, then the time it takes the plane to retrieve the data once the command and address are presented (i.e. read latency) is nearly constant.  

A Special Grocery Store

To make these concepts concrete, consider an everyday example of the sort of parallel system of queues represented by SSD architecture: a grocery store checkout line.  Imagine this grocery store has 64 checkout lanes. Shoppers are required to use a specific lane, assigned according to their name.  If every shopper purchased exactly one item, the latency (time in the store) and throughput (shoppers per unit time) would exhibit statistics similar to those of an SSD doing uniform random reads of identical size.

In characterizing storage performance, queue depth (QD) is frequently used as the independent parameter. The queue depth for the entire store (SSD) refers to the number of shoppers (read commands) in the system. Assuming shopper's names (read addresses) are uniformly distributed, the average QD of the individual checkout lanes (flash planes) will be 1/64 of that value.

 

Grocery analogy - Low QD Read

Uniform random reads, low queue depth

As the number of shoppers in the store grows from 1 - 2 - 3 ..., we expect to see little variation in their average time in the store (read latency), because most shoppers encounter an empty checkout line and do not need to wait. This condition will persist so long as the average arrival rate of new shoppers (read I/Os per second) is low compared to the maximum sustained aggregate rate of all the checkout operators (maximum read IOPS).

Longer Queues

As the rate of new shopper (read storage commands) arrival gets closer to the maximum the system can handle, the lengths of individual queues grows.

Grocery analogy - High QD Reads

The mean times spent by shoppers in the store (read latency) is mathematically modeled using queuing theory:  

Equation - tREAD vs load

Load% is the fraction of the maximum sustained checkout rate (read IOPS) required to serve the traffic.  Under some relatively benign assumptions, the delay percentiles can be seen in the graph below.  This graph plots SSD read access time in units of the "Queue Depth 1" access time, i.e. the time required for an idle system to process a single read command.  The X-axis measures the workload intensity, or the arrival rate of read commands, as a percentage of the maximum read IOPS rate the SSD can deliver.

The dotted line shows the mean latency.  At 50% load, the average read takes twice as long as a single, isolated read due to the probability that it will wait behind another read command.  At 90% load, the average read latency is ten times the QD1 value.

Since the queueing delay time is a random variable, it's useful to consider descriptors other than the mean value, e.g. percentiles.  A value representing the 90th percentile read delay means that 90% of all reads will experience delays less than or equal to that value.  As shown in the chart, the 90th percentile latency at 50% load is 4.6 times the base value, while 90th percentile at 90% load climbs to 23x the delay for QD1!  Again, this result stems from queuing theory and is not related to the design of the SSD itself.

 

Graph - tRead vs. Workload Intensity by percentile

Read Latency vs. Workload Intensity 

When characterizing an SSD, this queuing behavior presents as a break point in the latency vs. QD relation. At very low QD, most checkout lanes are empty, and so delay time is fairly constant. But once each lane has a few shoppers, additional load increases average queue length and thus queuing delay sharply. As load nears 100% delay rises toward infinity.

So, good latency performance, IOPS load should be kept below 80% or so, but to maximize IOPS/$, SSD's should be kept reasonably loaded.  This tradeoff is an important factor in maximizing the value of flash storage.

Introducing Writes

Now consider what happens when write commands are interspersed with reads.  Due to the nature of NAND flash, writes, or program operations, take from 10 to 30 times longer than reads.  This can be modeled as shoppers with very full shopping carts. Here write amplification is ignored, as extra writes necessitated by internal rearrangement of stored data. More on the write amplification below.

Grocery analogy - High QD Mixed

Mixed read/write workload, high queue depth

As can be seen intuitively from the above image, write operations can significantly delay reads which are queued behind them.  Most SSD firmware tries to prioritize reads, but nonetheless, the effect of writes is pronounced: Overall throughput in shoppers per unit time (IOPS) decreases sharply as the percentage of writes increases. Also, mean latencies become longer, and at intermediate QD can be bimodal — depending on the fraction of read commands that wait for at least one write vs. just other reads.

Write Amplification

Write amplification is an artifact of two constraints imposed by semiconductor physics and the design of NAND flash memory. The first is that the write operation in flash is uni-polar, or one-sided. If a bit containing no data is taken to represent binary 0, it is possible to write a binary 1 to that location in-situ. But the reverse is not possible, as there is no capability to write an individual location to 0.  Instead, a region of memory must be erased, which sets all bits to 0. The challenge arises because erasures are only possible on relatively large regions of data, much larger than the typical 4kB data element associated with a Logical Block Address (LBA).

SSD drive firmware must continually rearrange data internally on the drive to de-fragment the internal address space and prepare freshly erased regions to accept new data. Write operations associated with this activity compete with external I/O operations by consuming performance. This housekeeping only becomes necessary when data previously written to the drive is overwritten with a new value. This can make it  appear that external write traffic is amplified in terms of its effect on the drive's performance  budget (and endurance). The ratio of total bytes written to flash media divided by the number of bytes written by the host is called the Write Amplification Factor, or WAF. 

WAF depends critically on two factors.  The first of these is the sequence of write addresses. It is commonly, and wrongly, thought that sequential addressing results in low WAF. In fact, since all flash writes are remapped internally by the drive, logical block addresses simply serve as tags with which the data can be identified, and their numerical value carries no meaning. What is important is the degree to which the sequence of LBAs used to write the data once is repeated when that data is overwritten. This sequential addressing exhibits low WAF not because of the order of the LBA values but because that order repeats over and over.

The second key factor is called over-provisioning, or OP.  This is the amount of unoccupied data space available within the drive. All drives ship with a total storage capacity larger than the logical capacity exposed to the user. The difference is OP.  However, any data locations that are de-allocated a.k.a. TRIMmed, or never written are also spare space, and lower WAF just as effectively as the designed-in equivalent. Effective, or instantaneous OP, is the total amount of unoccupied capacity, as a ratio to the logical capacity of the drive. In some cases, OP can exceed 100%.

The chart below shows the relationship between OP and write amplification for a worst-case address sequence.

Graph - WAF vs. Effective OP

Write Amplification vs. Effective Over-provisioning for uniformly distributed random writes

Write Latency

SSD designers attempt to hide the much longer delay time for writes by means of a write buffer. Generally implemented in power-loss-protected RAM, this buffer captures write traffic quickly, so that completion of the command can be returned to the host without delay. Then the data is transferred from RAM to flash. While the latency of individual writes is long, the internal aggregate write bandwidth is generally high enough to avoid buffer overflow even at very heavy write workloads. Thus, actual write latencies specified on any SSD data sheet are generally lower than read latencies, even though the situation at the flash media is the opposite.

Efficiency vs. Load

In modern storage systems, the denominator of the cost-effectiveness fraction is more frequently IOPS than storage capacity. I/O rate is a quantity that can be purchased and allocated just like capacity. These desiderata might lead us to challenge each SSD with the maximum amount of data it can hold, and the maximum I/O load it can sustain, in order to extract maximum efficiency. As the latency graph above shows, the penalty associated with this strategy is potentially severe.

Workload Blending

Performance needs of users can differ widely. KumoScale uses the notion of Advanced Storage Classes as a mechanism for application owners to describe the performance they need, and for infrastructure owners to apportion that performance among applications. A distinct Storage Class should be associated with a set of client application instances that share; 1) common I/O behavior, and 2) common I/O performance requirements.

When estimating the impact of allocating a new volume to a particular SSD, several effects must be accounted for. First, the incremental work from the new client adds to the queue depth seen by all users sharing the drive, reducing performance. Second, as the SSD now contains more data than before, the effective over-provisioning factor of the drive is reduced, increasing write amplification for all resident volumes. This puts even more stress on latency by increasing the number of "full shopping carts" waiting in line for each flash device. For relatively full drives, this factor can far outweigh the prior one.

By juxtaposing these models and using them to predict the impact of new or changed drive assignments, KumoScale software can work to ensure that every volume receives its performance quota, and also that physical assets are used efficiently but not over-burdened.