Introduction to Kubernetes* Storage and NVMe-oF* Support
Storage is one of the most important ingredients for containers. It offers the capability of data writing and reading to containers during computation on clouds.
This article introduces container storage architecture, the differences between ephemeral storage and persistent storage on containers, volume plugin in Docker* to support persistent storage, volume management of Kubernetes, its extension Container Storage Interface (CSI) to support multiple storage drivers from various storage providers for Kubernetes volume plugins, and finally NVMe-oF CSI for Kubernetes. It has been proven that NVMe-oF performance is close to native NVMe performance on containers, even though it runs over networks. We completed the implementation of a prototype for NVMe-oF in Kubernetes, and a CSI drive on GitHub*.
CONTAINER STORAGE ARCHITECTURE
In the past, containers usually had no state, and after providing services and being destroyed, a container would unnecessarily save its state for future use. However, with the development of applications running in containers, the containers have evolved from the original stateless to stateful. Stateful containerized applications require the capability to persist data.
Figure 1. Storage Model of Container
By default, temporary files generated by the container runtime are stored in the writable layer, which is also called the container layer. That is to say, the data is saved inside the container, when the lifecycle of the container ends, the data also disappears. That brings up these problems:
- Data persistence: when a container stops, the data it generated is lost.
- Highly coupled: the writable layer of a container is highly coupled with the node which the container runs on, hence, data in the layer is not migratable, shareable, or backupable with ease.
- Poor performance: a container calls kernel modules to link file systems and the writable layer of the container manages storage through drivers, which causes significant drop on I/O performance.
To solve the above problems, Docker* developed a mounting mechanism. Using this mechanism, a file or directory of the node is mapped into the container, and both the container and any process on the node can modify the data in the mounted file or directory. In later releases, Docker added support for a volume plugin. If an application in a container needs to save any data, it can write the data into a persistent volume. In this way, containers can be launched, stopped, scaled, or migrated across multiple nodes. If the application has persistent data, the data can be shared with multiple nodes, and if the application needs to write any data, the storage interface in the form of files is more suitable for the application to access.
Figure 2. Container Storage Architecture
In Figure 2, the major ingredients for container storage are the control plane and the data plane.
- Storage control plane: The control plane is usually implemented by software. It mainly receives storage requests from the northbound API, such as creating, deleting, loading, unloading, and migrating data volumes, and passes the requests to the underlying data plane to complete the actual storage operations. The control plane generally needs to conform to certain specifications of the API, either the volume plugin API of the Docker container engine or the volume API of Kubernetes.
- Storage data plane: The data plane provides data persistence. It not only needs to implement storage operations such as read and write, snapshot, data protection and replication on container data volumes, but also needs the ability to share data with multiple nodes. The data plane can be based on file systems such as NFS*, Ceph* FS, etc., or it can be implemented based on block devices such as iSCSI.
EPHEMERAL STORAGE AND PERSISTENT STORAGE
Docker offers a variety of storage options for users. It uses storage drivers to manage the contents of the image layers and the writable container layer, and defines a set of interfaces for storage drivers. As long as the interfaces are implemented, a storage driver can be implemented. Currently, implemented storage engines include aufs, btrfs, device mapper, vfs, overlay, and others. Although each storage driver handles the implementation differently, they all use stackable image layers and the copy-on-write (CoW) strategy. The correct storage driver should be chosen according to the user’s scenario, in order to effectively improve Docker’s performance.
Two of the popular Docker storage drivers are AUFS and overlayFS. AUFS is a union filesystem. The AUFS storage driver was previously the default storage driver used for managing images and layers on Docker for Ubuntu*, and for Debian* versions prior to Stretch. Its main function is to combine different directories at different physical positions, mount them onto one directory, and then provide the one directory to users. AUFS has efficient disk utilization, for a scenario where an application uses a large number of containers and shared images across those containers. However, due to long latency of write operations, its performance on write is not perfect as expected, for instance, for a scenario containing intensive write operations. Docker recommends using the newer overlay2, which has potential performance advantages over the aufs storage driver.
OverlayFS is a modern union filesystem that is similar to AUFS, but faster and with a simpler implementation. OverlayFS layers two directories called lowerdir and upperdir on a single node and presents them as a single directory. These directories are called layers and the unification process is referred to as a union mount. The unified view is exposed through its own directory called merged. Docker provides two storage drivers for OverlayFS: the original overlay, and the newer and more stable overlay2. Due to its simplicity, OverlayFS has better performance under most circumstances, and overlay2 is better than overlay.
By default, the container loads the read-only image layer and adds a writable layer on it, providing a unified view to the container through the union file system. When a running container reads and writes data, the I/O data stream passes through the union file system. When the container is destroyed, the data will also be deleted, which has two major drawbacks: poor I/O performance and the same life cycle as the container.
Volume was introduced to Docker to solve the above problem. Volume mounts a directory of a node into the container, and the directory was read or written directly from the container, bypassing the hierarchical file system. Compared with the hierarchical layers, the I/O performance is close to the native I/O performance of the physical disk. Meanwhile, it provides the capability of persistent data beyond the life cycle of the container. Definitely, volume has its drawbacks. It is not able to read or write data remotely (it can only access local data), it can’t be migrated with the container, and it can’t support data sharing.
Docker introduced volume plugin since v1.8 to manage the life cycle of the volume. Its primary job is to map the third-party storage into the local file of the node so that the container can use the volume. According to the specification of the volume plugin, the volume of the third-party vendor provides services in the Docker engine, so that the third-party storage is independent of the life cycle of the container. This means various storage devices can be mounted onto Docker as long as they conform to the specification. So far popular storage devices which implemented Docker volume plugin include NFS*, CIFS, GlusterFS*, block devices, and more.
VOLUME MANAGEMENT IN KUBERNETES
Kubernetes is a well-known, portable, extensible, open source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available. This section focuses on volume management in Kubernetes.
As mentioned earlier, Docker v1.8 has released a specification for its volume plugin and allows third-party vendors to provide data services in the Docker runtime engine by implementing the specification of the volume plugin. However, for Kubernetes, it doesn’t use the volume plugin because it doesn’t wish to be bound to a specific container technology. Kubernetes defines its own specification of generic volume plugin to support more containers, including Docker and Rocket*.
Compared to Docker's volume which is for a container, the data on Kubernete volume is usually persistent data for a pod. That is, the granularity of the data is for the life cycle of a pod rather than a container in the pod. So, the life of the volume is longer than any of the containers in the pod, and the data persists when the container is restarted. When the pod is deleted, the data may be persistent or removed, which is determined by different volume implementations. If the data needs to be persistent, other pods can reload the data on the volume to reuse.
Volumes are divided into shared and non-shared. Non-shared volumes can only be mounted by a node, such as iSCSI, Amazon* Elastic Block Store (Amazon EBS*), etc. In comparison, shared volumes can be used simultaneously by multiple pods on different nodes, such as NFS, GlusterFS, CephFS and other network file systems, and block devices that support being read or written by multiple nodes. Shared volumes can easily support container migration across multiple nodes in the same cluster.
Persistent Volume (PV) and Persistent Volume Claim (PVC) are two other important concepts besides volume in Kubernetes storage. The volume of Kubernetes can be used to connect external pre-created volumes to a pod. During the connection, the parameters of the volume for the pod cannot be configured, such as volume size and IOPS, because these parameters are pre-set by the external storage that provides the external volume. In order to provide more granular volume management for containers, Kubernetes adds the functionality of PV, which treats external storage as a resource pool and is provided to the entire Kubernetes cluster. Each PV has some storage capabilities as ordinary storage, such as volume capacity, read/write access modes, and so on. When a pod needs to store data, the storage resource is requested from Kubernetes, which is called PVC. The PVC also has capabilities such as capacity and read/write access modes. Kubernetes matches the appropriate resource per the request, assigns it to the pod, and mounts the volume to the node where the pod is running, for the use of the pod.
Unlike ordinary Volume, PV is a resource object in Kubernetes, which has its own lifecycle independent of the pod. In Kubernetes, PV controller is used to implement and manage the life cycle of PV and PVC. Creating a PV is equivalent to creating a storage resource object in Kubernetes. For a pod, using the PV must be requested through PVC. The life cycle of PV and PVC is divided into five phases:
- Provisioning: a PV is pre-created by an administrator in the static mode, or created by StorageClass provided by the administrator in the dynamic mode, where StorageClass is applied by PVC and reflects user storage requirements.
- Binding: the PV is bound and assigned to a PVC.
- Using: the pod uses the volume with the PVC.
- Releasing: the pod releases the volume and removes the PVC.
- Reclaiming: Kubernetes recycles the PV, or retains the PV for next use, or removes the PV from cloud storage.
Typically a user creates a PVC based on the capacity and the access mode it requires. The master node of Kubernetes listens for the newly requests of PVC generation, looks for and matches the PV, and binds both together. Once the PV is bound to the PVC, the relationship between them is exclusive and unique, meaning the binding of PVC and PV is a one-to-one mapping. If there is no PV satisfying the requirements of the PVC request, the PVC will be in an unbound state indefinitely. Once there is a matching PV, the PVC will bind to the PV immediately.
The pod uses the PVC as a volume, and the Kubernetes cluster looks up the bound PV through the PVC and mounts it to the pod. For the volume that supports multiple access modes, the user can specify its desired access mode when using the PVC as the volume, and the PV is exclusively owned by the user.
When the user completes using the volume, the PVC can be deleted and can even be re-applied in the future. After the PVC is deleted, the corresponding persistent storage volume is released, but it can’t be used by other PVCs, because the data is still saved on the volume. Three policies can be adopted to reclaim the PV: retention, recycle and deletion.
The retain reclaim policy allows the user to re-apply for the PV. The delete reclaim policy is to delete both the volume and the data on the volume. When a deletion operation occurs, the PV is deleted from the Kubernetes cluster and the external storage is also deleted, where deletion is defined by the third-party storage vendor of the external storage. The recycle reclaim policy is deprecated at this time. The basic erasure operation by rm -rf is performed, and later the volume can be re-applied.
Figure 3. Relationship between Volume, PV, PVC, and StorageClass
In summary, the relationship between the four important concepts (volume, PV, PVC, and StorageClass) in Kubernetes storage is shown in Figure 3. Volume is the most basic storage abstraction, and it belongs to the storage infrastructure supporting local storage, NFS, and numerous cloud storage. Developers can also write their own storage plugins to support their specific storage systems, as long as the plugins conform to the Kubernetes specification for volume plugins. Volume can be used directly by a pod or by a PV. By providing different StorageClasses, Kubernetes cluster administrators can meet different storage requirements on quality levels, backup policies, and others. Kubernetes is able to automatically create PVs meeting the user’s needs.
NVME-OF SUPPORT FOR KUBERNETES
Kubernetes supports storage vendors in a plugin manner. Storage providers provide their own storage drivers by implementing the specification of Kubernetes volume plugin. This approach provides Kubernetes with a rich list of storage support. In terms of implementation, this storage volume plugin is built-in as an in-tree plugin, meaning the plugin is linked, compiled, built and distributed with Kubernetes. Adding support for a new storage system into Kubernetes requires checking the code of the storage driver into the Kubernetes mainstream repository, which doesn’t make much sense.
The Container Storage Interface (CSI) was developed to solve this problem. It allows driver development outside of the Kubernetes mainstream and preserves the current storage architecture of Kubernetes. It simply plugs CSI into Kubernetes to provide storage services. The CSI uses gRPC calls, which run on sockets so that it can be run in a container independently in an isolated manner. As a result, Kubernetes and the storage providers are completely decoupled, and each storage component as a container can run on Kubernetes instead of running as a Kubernetes component on the host node.
With the development of NVMe SSD, the local computing capability on a single server can no longer exploit the full performance of SSD; therefore, the compute capability is becoming a performance bottleneck. To solve this problem, storage is separated from compute and SSDs are put into a storage cluster, which can be accessed by many remote nodes. The NVMe committee developed the NVMe over Fabrics (NVMe-oF) specification to enable communication between a host and storage over a network, and solve the network bandwidth and latency challenge from the data transmission between local node and remote storage. About 90% of the NVMe-oF protocol is the same as the local NVMe protocol. The NVMe-oF protocol mainly extends the transport section of the local NVMe protocol to support various networking fabrics, such as RDMA and Fibre Channel, which can be provided by Infiniband*, iWARP* and RoCE* capable NICs. Because the NVMe interface is very efficient and lightweight, the bottleneck is removed.
Samsung* has tested the performance of NVMe-oF as a Kubernetes PV plugin and compared with iSCSI on Samsung all-flash array reference design. It used IOPS of random read (4KB) and bandwidth of sequential read (128KB). It is unclear whether write operations were performed, but Samsung's presentation at Flash Memory Summit 2016 explained that the experimental result proved PVs based on NVMe-oF delivers native NVMe performance to containers.
Figure 4. NVMe-oF CSI
Intel has also proposed its design to support NVMe-oF as a plugin for Kubernetes storage, shown in Figure 4. Here, the components on Kubernetes master are responsible for creating or deleting the NVMe-oF storage volume when the request is sent by the user via Kubectl. The master calls the storage provisioner with RESTful APIs through NVMe-oF control service, and then the provisioner can use the standard swordfish APIs to call an ingredient on the storage node called Pooled Storage Management Engine (PSME) to compose or decompose NVMe resources.
Different from the pod concept in Kubernetes, Intel® Rack Scale Design (Intel® RSD) Pod Manager (PODM) in Figure 4 refers to the manager of a physical collection of multiple racks. It is a coincidence that both Kubernetes and Intel RSD have given the same name to two different concepts.
Once the resource is allocated, nvmecli is used on the Kubernetes node to configure and attach the storage volume to the pod by service containers through NVMe-oF initiator. After that, the pod reads data from and writes data to the remote NVMe disks through NVMe-oF initiator too, which forms the data plane of the storage system. Many network protocols are supported for the networking between Kubernetes nodes and the storage nodes nowadays, but RDMA is the most popular one to adopt.
After the pod does not need the volume any longer, the service containers are used to detach the volume from the pod, and the NVMe resource is deleted following the same path as volume creation. Right now the feature and its third-party CI/CD are still under development for Kubernetes upstream, but the proof-of-concept is tested internally.
At the beginning, this article introduced container storage architecture, and described containers used to be stateless which means ephemeral storage is sufficient. Then with the development of container technology, persistent storage is required on containers. The article described the volume plugin implementation in Docker to support persistent storage, volume management of Kubernetes, and its CSI to support multiple storage drivers for Kubernetes volume plugins. Finally, it also introduced a specific CSI for NVMe-oF in Kubernetes. Readers may download the code for the CSI driver for Intel® RSD at github.com and try it out.
In the future, we are looking forward to more people contributing to the ecosystem buildup of new NVMe-oF technology and integrating it into the Kubernetes community with more additional attributes added in the NVMe-oF volume specification.
- Manage data in Docker: https://docs.docker.com/storage/
- Storage concept in Kubernetes: https://kubernetes.io/docs/concepts/storage/
- The Next Generation of Cloud Storage Fabric: NVMe-oF* article: https://01.org/blogs/qiaoweir/2019/next-generation-cloud-storage-fabric-nvme
- NVMe-oF Storage Volumes for Containers: https://www.samsung.com/us/labs/pdfs/2016-fms-203-c-nvmeof-storage-volumes-for-containers.pdf
- Intel® Rack Scale Design Pod Manager (PODM): https://www.thailand.intel.com/content/dam/www/public/us/en/documents/guides/pod-manager-api-spec-v2-2.pdf
- Pooled Storage Management Engine (PSME): https://github.com/intel/intelRSD/tree/master/PSME
- Container Storage Interface (CSI) Driver for Intel® Rack Scale Design (Intel® RSD) NVMeoF: https://github.com/intel/csi-intel-rsd
Shane Wang, Individual Director of OpenStack Foundation Board and Engineering Manager of Networking and Storage Team at Intel System Software Products