Containers are very different than virtual machines. The more prescriptive way containers store data forces admins and developers to put more thought into which data goes where. It also means traditional data protection makes little sense for backing up container-based applications. Solutions like PortWorx PX-Backup are built to support container environments natively.
I sat down with the Pure Storage and PortWorx folks twice (at TheCUBE’s coverage of KubeCon/CloudNativeCon and at Cloud Field Day 9) recently to talk about natively protecting data in container environments, and this blog post covers how their PX-Backup product solves these challenges.
I recently talked to Matt Kixmoeller and Michael Ferranti of Pure Storage and PortWorx to discuss the data protection issues in container environments.
At a macro level, Kubernetes environments don’t have any more or any less state than more traditional environments. It’s just labeled more stringently, layered in a stack of single-purpose disk images, fragmented across a more diverse range of (cloud) storage services and codified into desired-state and version-controlled pipelines.
So while the way data is stored is different, there’s still the same (if not more) amount of data to protect. The differences in how data is stored though, require a completely new approach to data protection for container-based applications. In this post, I’ll dive into these differences.
First up, container images are very different from virtual machine-based applications, and data protection tools need to handle container images and the fragmentation of data storage natively to be of any use in protecting container-based applications.
Container images are layered in a chained ‘stack’ from operating system to application and everything in-between. Each image is version controlled, making upgrades, patches, configuration changes and component replacements possible without the usual furry hairball of a virtual machine disk image, where there’s no image-based separation of these layers of operating system, middleware and application.
While the single-app-per-image best practice hasn’t changed too much between virtual machines and containers, not having to include a base operating system with each container simplifies data management vastly, and the more clear separation between these layers makes all the difference.
We can safely say that container images are much, much better in terms of data separation, making it easier to figure out what data to protect in what way. We only need to store a single copy of each version of the base images, instead of using post-process deduplication to remove duplication across every single virtual machine based on the same image.
Since all the lower-level layers of the container image are read-only, we don’t even need to back up those layers for every running container, but we only need to back up the source images in the artifact repository. Modern data protection solutions like PX-Backup and Zerto for Kubernetes understand this paradigm.
Backing up State means backing up Pipelines
That brings us to each running container’s running state, though. If every container image is based on a chain of ready-only images, do we even need to back up the running container image?
Ideally, no, as the desired state is expressed in its deployment infrastructure-as-code files (like Terraform files), plus the Dockerfiles that describe the container anatomy and the Kubernetes Pod yaml files describing the application.
All this state absolutely needs to be included in backups, and can be captured at the artifact repository and version control sources. With that in mind, backing up the CI/CD pipelines is just as important as protecting the running state itself.
But even then, it makes sense to capture the state of the running system to protect things like Kubernetes Service descriptions, load balancers (and other networking) state, ConfigMaps, and namespaces to be able to re-create the application on the same or another cluster for data recovery purposes.
Because reality is messy, and even if you practice GitOps, the operational state may differ from the declared state. Doing Operations is still hard, and under pressure of an outage people use shortcuts and change production directly instead of going through the pipeline.
The Entry Point for Data Protection has changed
With Virtual Machines, we could safely assume the VM would be running long enough to successfully create a backup of that particular VM. With containers, it’s a different story. Containers are ephemeral, so there’s no guarantee a container will be running (at all or long enough) to back up data.
The Kubernetes control plane is the logical entry point for backing up data, as the backup software can read configurations to understand the anatomy of an application’s storage claims, do metadata operations (like backing up all the Kubernetes metadata for a given app). This requires specific integration between the Kubernetes APIs and data protection software.
Container images do not contain persistent application data. While this is not vastly different from a well-architected VM-based application, containers enforce data separation but also implicitly fragment data storage across many storage services, some of which may be invisible and out of control of backup operators.
While often this application data is stored on managed, well-known object, file or block storage service; it does happen that application data is stored on an unknown and unmanaged storage service. This is akin the shadow IT problem IT ops faced where their users would use unsanctioned, unknown cloud and SaaS services.
While block and file-based storage is easily detected through the Kubernetes metadata, object storage is harder to discover, as the allocation of the object storage does not happen in the Kubernetes metadata, but in the application code itself.
In any case, the immense fragmentation of data across these services poses challenges for consistent data backup using each service’s native data protection tools, requiring the backup software to add support for these storage services through each API to successfully protect entire applications across block, file and object storage.
Fragmentation also happens within Kubernetes itself, as developers are constantly creating and deleting containers, and containers usually have non-descriptive naming making it hard to understand what is actually running on Kubernetes at any given moment.
PX-Backup natively understands and interprets the Kubernetes metadata associated with running containers to understand relationships between containers to figure out which containers make up a given application. By allowing developers to tag data protection policies on the application-level rather than the individual container level, admins can keep track of data protection.
If you want to learn more about PX-Backup, I recommend watching the same deep-dive I was part of during the recent Cloud Field Day 9 event: