Introduction to Kubernetes* Networking and Acceleration with DPDK
Kubernetes is a well-known portable, extensible, open-source platform for managing containerized workloads and services, which facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available. This article focuses on networking in Kubernetes, briefly introduces Kubernetes networking communication, and mentions two options for acceleration using the Data Plane Development Kit (DPDK).
Networking communication inside Kubernetes can be divided into four parts, shown in Figure 1:
- Container-to-container communication within a pod
- Pod-to-pod communication
- Communication between pod and service
- Communication between service and an application outside of the Kubernetes environment
Figure 1. Kubernetes Networking Model
Container-to-Container Communication within Pod
Kubernetes assigns an IP address to each pod, and all the containers in the same pod share the same network namespace of the pod, including IP addresses and network ports, which means they can be accessed mutually through the address of localhost and the container port. The networking model is called IP-per-Pod. It is important to note that a network namespace is a collection of network interfaces and routing tables, i.e., connections between two pieces of equipment on a network and instructions for where to send packets, respectively.
The implementation of this model leverages a secret container called the pause container and ensures that its namespace is enabled. When creating a pod, Kubernetes first creates a pause container on the node, acquires the respective pod's IP address and sets up the network namespace for all other containers that join the pod. All other containers in the pod called application containers only need to join the network namespace with
--net=container:<id> when they are created. After that, they all run in the same network namespace.
The containers in the pod are allowed to be accessed by the localhost network with the IP-per-Pod networking model of Kubernetes. The containers in the pod must coordinate to use the ports effectively, meaning no collision can exist on port assignments. They don't have to worry about port collision with the containers in the other pods.
A pod may run on the same node or may run on different nodes, so the communication between the pods is divided into two categories: communication between pods on the same node and communication between pods on different nodes.
For the communication between the pods on the same node, each pod thinks it has acquired a normal Ethernet device eth0 and has a real IP address. But Kubernetes fakes it by using a virtual Ethernet connection vethX which each pod is actually connected to. The connection has two sides: eth0 on the pod’s side, and vethX on the node’s side. The virtual device is a tunnel that connects the pod’s network with the node. A network bridge connects two networks together from two pods. Each pod on a node is part of the bridge and the bridge connects all pods on the same node. Using this mechanism, which is transparent to users, different pods on the same node can directly communicate using IP addresses, regardless of other discovery mechanisms such as DNS, Consul or etcd.
Figure 2. Inter-Pod Communication on the same node
As shown in Figure 2, Pod 1 and Pod 2 are connected to the same bridge through veth. Their IP addresses (IP1 and IP2) are dynamically obtained from the network segment of the bridge. In addition, on the Linux* protocol stack of Pod 1 and Pod 2, the default route is the address of the bridge. This means all network traffic to non-local addresses is sent to the bridge by default, and directly transferred by the bridge. Since both Pod 1 and Pod 2 are connected to the same bridge and are in the same network segment, they can communicate directly.
For communication between the pods on different nodes, the address of the pod is in the same network segment as the bridge, which serves as the default gateway. The network segment of the bridge and the NIC card of the node are two completely different IP network segments, and the communication between different nodes can only be performed through the physical NIC card of the node.
On the other hand, the private IP addresses which are dynamically allocated and hidden behind the bridge can also be found. Kubernetes records the IP allocation information of all the running pods, and saves the information in etcd, as the endpoint of the service, because those private IP addresses are required by Kubernetes for communication between the pods. In addition, planning those IP addresses for the pods is also important. Address collision must be avoided in the entire Kubernetes cluster.
In a nutshell, two conditions must be met to support inter-pod communication on different nodes:
- Plan and allocate the IP addresses for the pods in the entire Kubernetes cluster without any collision.
- In order to communicate between the pods, the IP address of the pod must be associated with the IP address of the node.
For the first condition, the IP address of the bridge needs to be planned to ensure there is no collision with the other IP addresses of the bridge on each node when Kubernetes is deployed. For the second condition, a mechanism is required to be aware of the node on which the IP address of the target pod is mounted when the source pod prepares to send data. That is, before sending the data to the NIC card of the target node, the IP address of the target node must be found, then the data can be transferred to the bridge on that node. Once the data reaches the target node, the bridge inside that node knows how to send the data to the target pod.
Figure 3. Inter-Pod Communication on different nodes
As shown in Figure 3, IP 1 corresponds to Pod 1, and IP 2 corresponds to Pod 2. When Pod 1 accesses Pod 2, it first sends the packets from the eth0 of the source pod Pod 1, finds and reaches the eth0 of the target node Node 2. That is, in most cases, packets are sent from Node 1 to Node 2 by tunnel protocol such as vxlan, then to Pod 2.
Therefore, addressing and communication between the pods on different nodes must be implemented using the IP addresses of the nodes, and using a table mapping IP address ranges to various nodes at the cluster level. In the real environment, along with deploying Kubernetes and Docker*, additional network configuration is required to achieve this. In some cases, additional facilities or plugins are also required so that the pods can communicate with each other transparently.
Communication between Pod and Service
The pods in Kubernetes are not stable or long lasting, they can be destroyed and created for different reasons. For instance, in the process of vertical scaling or rolling upgrade, an old pod is destroyed and replaced with a new pod, which changes the IP addresses. To prevent a situation where the frontend cannot access a pod due to an IP address change, Kubernetes introduced the concept of a service.
Figure 4. Communication between Pod and Service
When a service is created, Kubernetes assigns it a virtual IP address which is fixed during the lifetime of the service. When accessing the functions provided by a container in a pod, the IP address and the port of the pod are not accessed directly, instead, the virtual IP address of the service and its port are used. The service forwards the request to the pod. For instance, in Figure 4, three pods exist behind the frontend service. Furthermore, Kubernetes also implements load balancing, service discovery and DNS, and so forth, through services.
When a service is created, Kubernetes looks for the pod with the label selector of the service, and creates an endpoint with the same name as the service. The target port of the service and the IP address of the pod are saved in that endpoint. After the IP address of the pod is changed, the endpoint is also changed accordingly. When the service receives a new request, it is able to find the target address to forward the request through the endpoint.
A service is an abstract entity, and its IP address is virtual. The kube-proxy on the node is responsible for forwarding requests.
In Kubernetes v1.0, a service is a structure at Layer 4, i.e. TCP/UDP over IP, in the Open System Interconnection (OSI) seven-layer network model, and kube-proxy runs purely in the user space.
In Kubernetes v1.1, the ingress API (beta) was added to represent Layer 7, i.e. HTTP service. Also, the proxy iptables was added and it has become the default mode of kube-proxy operation since Kubernetes v1.2.
In Kubernetes v1.8.0-beta.0, the proxy ipvs was added. Therefore, there are currently three request forwarding modes for kube-proxy: userspace, iptables, and ipvs respectively.
In the userspace mode, kube-proxy monitors addition and deletion of services and endpoints by the master, which is the main controlling unit of the cluster managing and scheduling pods on to the worker nodes. When a service is created, kube-proxy on the node randomly opens a port for the service, which is called a proxy port, and then establishes an iptables rule. Later, iptables completes traffic forwarding from
<the virtual IP of the service, the port>to the proxy port, and then selects a pod in the endpoint and transfers traffic from the proxy port to the pod.When there are multiple pods under an endpoint, there are two algorithms for selecting the pod. One is a round-robin method, meaning to try the next pod if one pod doesn’t respond. The other method is to pick up a pod which is closer to the source IP address of the request.
In the iptables mode, when a service is created, kube-proxy on the node establishes two iptables rules, one for the service to transfer the traffic of
<the virtual IP of the service, the port>to the backend, the other for the endpoint to select the pod, where the selection is random by default.kube-proxy unnecessarily switches between the user space and the kernel space in that mode, so it should run faster and more reliably than the userspace mode. However, unlike kube-proxy in the userspace mode, if the pod selected initially doesn’t respond, kube-proxy in the iptables mode doesn’t retry another pod automatically. So using the iptables mode requires readiness probes.Kubelet uses readiness probes to understand when the container can start accepting traffic. When all the containers inside a pod are ready, the pod is considered ready to accept traffic. When the pod is not ready, it is removed from the service load balancer. Hence, readiness probes can be used to control which pods are used as the backend for a service.
In the ipvs mode, kube-proxy calls the netlink interface to create corresponding ipvs rules, and periodically synchronize the ipvs rules with the service and the endpoint to ensure that the ipvs state is consistent with the expectation. When a service is accessed, the traffic is redirected to one of the backend pods. Similar to iptables, ipvs is also based on the netfilter hook function. However, the difference is iptables uses sequential matching, while ipvs uses hash-based matching. When the number of the rules is large, the matching duration of the iptables will be significantly longer. A hash table is used as the underlying data structure for ipvs and the matching works in the kernel space. In that way, the matching duration becomes shorter. This also means ipvs can redirect traffic faster and has better performance when synchronizing the proxy rules. Also, ipvs provides a variety of load balancing algorithms, such as: round-robin (rr), least connection (lc), destination hashing (dh), source hashing (sh), shortest expected delay (sed), never queue (nq), and others.It is worth noting the ipvs mode requires the ipvs kernel module to be pre-installed on the node. When kube-proxy is started in the ipvs mode, kube-proxy verifies whether the ipvs module is installed on the node or not. If negative, kube-proxy uses the iptables mode.
Communication between Service and the outside of Kubernetes
Kubernetes offers four types of services:
- ClusterIP: provides services on the IP address inside a cluster. This type of service can only be accessed from within the cluster. This is the default type in Kubernetes.
- NodePort: provides external services through a static port (NodePort) on NodeIP. The outside of the cluster can access the corresponding port by accessing
<NodeIP>: <NodePort>. When this mode is used, ClusterIP is automatically created, and requests to access NodePort are eventually routed to ClusterIP.
- LoadBalancer: provides services out of the cluster by using a cloud service provider's load balancer. When this mode is used, NodePort and ClusterIP are automatically created, and the load balancer out of the cluster will eventually route the requests to NodePort and ClusterIP.
- ExternalName: maps services to a resource out of the cluster, for instance,
foo.bar.example.com. Using this mode requires kube-dns version 1.7 or higher.
Networking Communication Acceleration with DPDK in Kubernetes
DPDK is a set of data plane libraries and network interface controller drivers for fast packet processing, currently managed as an open-source project under the Linux Foundation*. The DPDK provides a programming framework for x86, ARM*, and PowerPC* processors and it enables faster development of high speed data packet networking applications.
DPDK bypasses the heavy layers of the Linux kernel networking stack and talks directly to the network hardware. It uses core ingredients composed by libraries and drivers, which are called Poll Mode Drivers (PMDs). It also uses memory hugepages, which means a smaller number of memory pages is needed and the number of Translation Lookaside Buffer (TLB) misses is reduced significantly. Furthermore, DPDK requires or leverages other platform technologies to avoid unnecessary overhead and increase performance, including CPU pinning for CPU-intensive workloads, Non Uniform Memory Access (NUMA), Data Direct I/O (DDIO), a few IA new instructions and Enhanced Platform Awareness (EPA) features, and others.
To improve the performance of Kubernetes networking communication, Intel proposed several acceleration solutions for containers, such as a Single Root I/O Virtualization (SR-IOV) plugin and virtio-user, shown in Figure 5 below.
Figure 5. Networking Acceleration for Container with DPDK
virtio-user is an option for leveraging DPDK. It is the frontend of virtio, and is able to connect to vhost backends, such as DPDK vhost-user for communication. In this example, the DPDK vhost-user port is provided by a software virtual switch such as Open vSwitch* (OVS*)-DPDK or Vector Packet Processing (VPP).
SR-IOV plugin is another alternative that applies SR-IOV into Kubernetes and containers. When a container is running, the NIC card (i.e. the virtual NIC) is visible to the container by adding Virtual Function (VF) into the network namespace of the container. With that and the VF driver using DPDK in the user space, the network performance of the container is improved significantly.
To support a more flexible, user-defined networking model, Kubernetes provides a network plug-in interface that conforms to the Container Network Interface (CNI) container network specification. In order to apply SR-IOV and virtio-user to a Kubernetes-based container environment, Intel provides the SR-IOV CNI plugin and the vhost-user CNI plugin for the above two network acceleration options.
To help readers understand how networking works in the Kubernetes environment, this article has introduced Kubernetes networking communication, including container-to-container communication within a pod, pod-to-pod communication, communication between pod and service, and communication between service and an application outside of Kubernetes. It also mentioned two options for acceleration using DPDK, specifically, SR-IOV plugin and virtio-user. Networking acceleration in Kubernetes with DPDK and other hardware technologies is essential for services and workloads highly dependent on networking communication, for instance, various Containerized Network Functions (CNFs) in NFV. We encourage readers to try it out by downloading SR-IOV CNI plugin, attaching Kubernetes pods directly to an SR-IOV virtual function to obtain high performance networking for CNFs, and provide feedback via github.com.
- Enabling New Features with Kubernetes for NFV White Paper:
- Service concept in Kubernetes networking: https://kubernetes.io/docs/concepts/services-networking/service/
- Data Plane Development Kit: https://www.dpdk.org/
- SR-IOV CNI plugin: https://github.com/intel/sriov-cni
Shane Wang, Individual Director of OpenStack Foundation Board and Engineering Manager of Networking and Storage Team at Intel System Software Products