Kubernetes Storage Primitives for Stateful Workloads

This article is based on the presentation “Introduction to Kubernetes Storage Primitives for Stateful Workloads” from the OSS Convention Prague 2017 by the {Code} team. So, let’s start, what is Kubernetes?

Kubernetes

Kubernetes is the greek word for “Helmsman”; also the root of the word “Governor”.

What Kubernetes is/does:

Container orchestrator
Support multiple container runtimes (including runC from Docker)
Support cloud and bare-metal cluster
Inspired and informed by Google’s experience
OpenSource, written in Go

Kubernetes Manage applications, not machines!

Separation of Concerns

You can separate your information system in 4 layers

Application
Cluster (Kubernetes is here!)
Kernel/OS
Hardware

Ideally, each layer should be replaceable in a transparent way. Kubernetes embrace this philosophy by being heavily based on APIs.

Kubernetes Goals

Open API and implementation
Modular/replaceable
Don’t force apps to know about concepts that are:
- Cloud Provider Specific
- Kubernetes Specific
Enable Users To
- Write once, run anywhere
- Avoid vendor lock-in
- Avoid coupling app to infrastructure

Now let’s dig into the “pod” concept in Kubernetes. It is equivalent to a “node” in Docker Swarm.

Pods

A pod is the atomic piece to be deployed. It is composed of a small set of containers and volumes that are tightly coupled.

Some of its main properties are:

A shared namespace
- containers share IP address & localhost
- share IPC, etc..
A managed lifecycle
- a pod is bound to a node, it restart in placement
- a pod can die and cannot be reborn with same ID

Example:

# pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: mypod
spec:
  containers:
  - name: filepuller
    image: saadali/filepuller:v2
  - name: webserver
    image: saadali/webserver:v3

File modifications in a container are bound to the container instance only therefore a container’s termination/crash results in loss of data. This is particularly problematic for stateful apps or in the case where containers need to share files.

The Kubernetes Volume abstraction solves both of these problems.

Kubernetes Volumes

Kubernetes Volumes differ from Docker Volumes. In Docker, a volume is simply a directory on disk or in another container. Lifetimes are not managed and until very recently there were only local-disk-backed volumes. A Kubernetes volume, on the other hand, has an explicit lifetime.

A Kubernetes volume is:

A directory, possibly with some data in it
Accessible by all containers in pods

Volume plugins define:

How directory is setup
Medium that backs it
Contents of the directory

A volume’s lifetime is the same as the pod or longer.

More importantly, Kubernetes supports many types of volumes.

Kubernetes Volume plugins

Kubernetes has many volume plugins:

Remote Storage:
- GCE Persistent disk
- AWS
- Azure (FS & Data Disk)
- Dell EMC ScaleIO
- iSCSI
- Flocker
- NFS
- vSphere
- GlusterFS
- Ceph File and RBD
- Cinder
- Quobyte Volume
- FibreChannel
- VMware Photon PD
Ephemeral Storage
- Empty dir (tmpfs)
- Expose Kubernetes API
  - Secret
  - ConfigMap
  - DownwardAPI
Local Storage (Alpha)
- Containers exposing software-based storage
Out-of-Tree
- Flex (exec a binary, allows to use external drivers)
- CSI (Cloud Storage Interface, generic API specification for containers to define storage access, will come in a future release)
Other:
- Host path

Since Kubernetes is open, tierce-party storage may be available with out-of-tree plugins.

To ensure inter-operability between cluster orchestrator, CloudFoundry, Mesos, and Kubernetes are working on standard out-of-tree API for “universal” storage container with the CSI.

GCE PD Example

Volume can be referenced directly, for example:

# pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: sleepypod
  spec:
    volumes:
      - name: data
      - gcePersistentDisk:
        pdName: panda-disk
        fsType: ext4
    containers:
      - name: sleepycontainer
        image: ...
        command:
          - sleep
          - "6000"
        volumeMounts:
          - name: data
            mountPath: /data
            readonly: false

However, directly referencing a volume is “like tatooing the name of your girlfriend on your arm when you’re 16”, it may look like a good idea because you think it will last forever, but it generally doesn’t

Persistent Volume & Claims (PVC)

So the main principle is to separate persistent volume declaration from the pod.

First we declare the persistent volumes through a specific process. Then we bind a pod to an available volume through a persistent volume claim.

PV Example

Let’s create persistent volumes pv1 with 10GiB and pv2 with 100GiB, here is pv2 definition as an example:

# pv2.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name : mypv2
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 100Gi
  persistentVolumeReclaimPolicy: Retain
  gcePersistentDisk:
    fsType: ext4
    pdName: panda-disk2

And here is how we create them:

$ kubectl create -f pv1.yaml
persistentvolume "pv1" created
$ kubectl create -f pv2.yaml
persistentvolume "pv2" created
$ kubectl get pv
NAME          CAPACITY   ACCESSMODES   STATUS      CLAIM                        REASON    AGE
pv1           10Gi       RWO           Available                                          1m
pv2           100Gi      RWO           Available                                          1m

PVC Example

Now that we have unused persistent volumes, we can claim a container through a PVC:

# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mypvc
  namespace: testns
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi

When a claim is created, kubernetes will bound the claim with an available persistent volume:

$ kubectl create -f pvc.yaml
persistentvolumeclaim "mypvc" created
$ kubectl get pv
NAME          CAPACITY   ACCESSMODES   STATUS      CLAIM                        REASON    AGE
pv1           10Gi       RWO           Available                                          3m
pv2           100Gi      RWO           Bound       testns/mypvc                           3m

You can directly configure a PVC directly in the pod declaration:

# pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: sleepypod
spec:
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: mypvc
  containers:
    - name: sleepycontainer
      image: gcr.io/google_containers/busybox
      command:
        - sleep
        - "6000"
      volumeMounts:
        - name: data
          mountPath: /data
          readOnly: false

Dynamic Provisioning and Storage Classes

Allows storage to be created on-demand (when requested by user).
Eliminates need for cluster administrators to pre-provision storage.
Cluster/Storage admins “enable” dynamic provisioning by creating StorageClass
StorageClass defines the parameters used during creation.
StorageClass parameters are opaque to Kubernetes so storage providers can expose any number of custom parameters for the cluster admin to use.

Here’s how you declare a StorageClass:

# sc.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: slow
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-standard
--
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: fast
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd

Users consume storage the same way with a PVC
“Selecting” a storage class in PVC triggers dynamic provisioning

Here’s how to create a PVC with StorageClass:

# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mypvc
  namespace: testns
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: fast
$ kubectl create -f storage_class.yaml
storageclass "fast" created
$ kubectl create -f pvc.yaml
persistentvolumeclaim "mypvc" created
$ kubectl get pvc --all-namespaces
NAMESPACE   NAME                       STATUS    VOLUME                                     CAPACITY   ACCESSMODES   AGE
testns      mypvc                      Bound     pvc-331d7407-fe18-11e6-b7cd-42010a8000cd   100Gi      RWO           6s
$ kubectl get pv pvc-331d7407-fe18-11e6-b7cd-42010a8000cd
NAME                                       CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS    CLAIM          REASON    AGE
pvc-331d7407-fe18-11e6-b7cd-42010a8000cd   100Gi      RWO           Delete          Bound     testns/mypvc             13m

And then the user references the volume via PVC

# pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: sleepypod
spec:
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: mypvc
  containers:
    - name: sleepycontainer
      image: gcr.io/google_containers/busybox
      command:
        - sleep
        - "6000"
      volumeMounts:
        - name: data
          mountPath: /data
          readOnly: false

Default Storage Class

Default Storage Class allows dynamic provisioning even when a StorageClass is not specified in PVC.

Pre-installed Default Storage Classes:

Amazon AWS - EBS volume
Google Cloud (GCE/GKE) - GCE PD
Openstack - Cinder Volume

Default Storage Class feature was introduced as alpha in Kubernetes 1.2 (GA as of 1.6)

What’s Next for Kubernetes Storage?

Kubernetes Storage is investing in:

Container Storage Interface (CSI)
- Standardized Out-of-Tree File and Block Volume Plugins
Local Storage
- Making node local storage available as persistent volume
Capacity Isolation
- Setting up limits so that a single pod can’t consume all available node storage via overlay FS, logs, etc.

Impressions

Kubernetes provides, through APIs and plugins/drivers, a clean, agnostic, standardized way to declare and use volumes in your Kubernetes container. With this feature, you can actually migrate your FS backend cleanly and easily. Convergence and standardization of these plugins with solutions like CloudFoundry, Docker Swarm, and Mesos look like in progress, but only few information are available.

Share this article