Advanced data virtualization (Part 1)
After attending the recent SNW in Phoenix, and having some time to synthesize and digest, I’m convinced that the storage virtualization products on the market today are a good first step, but are going to have to evolve.
Over the next few weeks, I’ll offer several trends we foresee will occur to realize true storage virtualization. This week’s post will focus on the trend of advanced data virtualization.
Storage virtualization is creating efficiencies by inserting a layer of abstraction between data and storage hardware, and that same concept can be taken further to present a layer of abstraction between data and the method in which data is stored.
RAID is actually a well-known form of data virtualization because the linear sequence of bytes for data are transformed to stripe the data across the array, and include the necessary parity bits. RAID’s data virtualization technique was designed over 20 years ago to improve data reliability and I/O performance, and it is now in the process of failing as we transition from structured data to large quantities of unstructured data.
Dispersal, as we’ve discussed in numerous posts on this blog, is a natural successor for RAID for data virtualization because it can be configured with M of N fault tolerance, which can provide much higher levels of data reliability than RAID. Dispersal essentially packetizes the data (N packets), and only requires a subset (M packets) to bit perfectly recreate the data.
One major change for data virtualization that will occur as Dispersal replaces RAID is that there will no longer be a tight coupling between hardware and the storage of the data packets. This will eliminate the concept of having copies of data on hardware.
Today’s RAID systems stripe data and parity bits across disks within an array within an appliance. When asked “Where is my data?” the answer is typically “On this piece of hardware.” This gives people peace of mind in terms of sensing something that is intangible (since the data is actually virtualized) is actually tangible because it is contained within a physical device.
The shift for people will be from asking “Where is my data?” since it will be virtualized across multiple devices in multiple locations to “Is my data protected?” because the root of the first question is the second. Once people can get comfortable with actually giving up control of actually knowing exactly where their data resides, they will realize the benefits of data virtualization. A future post in this series will discuss how management systems will need to evolve to address Data Protection concerns.
The largest benefit to storing data packets across multiple hardware nodes is increased fault tolerance. RAID basically is structured to provide disk drive fault tolerance – as disk fails, and the other disks can reconstruct the data. Dispersal provides not only disk drive level fault tolerance, but also device drive fault tolerance, and even location fault tolerance. When an entire device fails, the data can be reconstructed from virtualized data packets on other devices, whether centrally located or across multiple sites.
We’ll discuss location fault tolerance in more detail in the next post.