At VMworld 2018, one of the sessions I presented on was running Kubernetes on vSphere, and specifically using vSAN for persistent storage. In that presentation (which you can find here), I used Hadoop as a specific example, primarily because there are a number of moving parts to Hadoop. For example, there is the concept of Namenode and a Datanode. Put simply, a Namenode provides the lookup for blocks, whereas Datanodes store the actual blocks of data. Namenodes can be configured in a HA pair with a standby Namenode, but this requires a lot more configuration and resources, and introduces additional components such as journals and zookeeper to provide high availability. There is also the option of a secondary Namenode, but this does not provide high availability. On the other hand, datanodes have their own built-in replication. The presentation showed how we could use vSAN to provide additional resilience to a namenode, but use less capacity and resources if a component like the Datanode has its own built-in protection.
A number of people have asked me how they could go about setting up such a configuration for themselves. They were especially interested in how to consume different policies for the different parts of the application.
In this article I will take a very simple namenode and Datanode configuration that will use persistent volumes with different policies on our vSAN datastore.
We will also show how through the use of a Statefulset, we can very easily scale the number of Datanodes (and their persistent volumes) from 1 to 3.
Helm Charts
To begin with, I tried to use the Helm charts to deploy Hadoop. You can find it here. While this was somewhat successful, I was unable to figure out a way to scale the deployment’s persistent storage. Each attempt to scale the statefulset successfully created new Pods, but all of the Pods tried to share the same persistent volume. And although you can change the persistent volume from ReadWriteOnce (single Pod access) to ReadWriteMany (multiple Pod access), this is not allowed on vSAN even though K8s does provide a config option for multiple Pods to share the same PV. On vSAN, Pods cannot share PVs at the time of writing.
However, if you simply want to deploy a single Namenode and a single Datanode with persistent volume, this stable Hadoop Helm chart will work just fine for you. Its also quite possible that this is achievable with the Helm chart. I’m afraid my limited knowledge of Helm meant that in order to create unique PVs for each Pod as I scaled my Datanodes, I had to look for an alternate method. This led me to a Hadoop Cluster on Kubernetes using flokkr docker images that was already available on GitHub.
Hadoop Cluster on Kubernetes
Let’s talk about the flokkr Hadoop cluster. In this solution, there were only two YAML files; the first was the config.yaml which passed in a bunch of environment variables to our Hadoop deployment (core-site.xml, yarn-site.xml, etc) via a configMap (more on this shortly). The second held details about the services and Statefulsets for the Namenode and Datanode. I will modify the Statefulsets so that the /data directory on the nodes will be placed on persistent volumes rather than using a local filesystem within the container (which is not persistent).
Use of configMap
I’ll be honest – this was the first time I has used the configMap construct, but it looks pretty powerful. In the config.yaml, there are entries for 5 configuration files required by Hadoop when the environment is bootstrapped – the core-site.xml, the hdfs-site.xml, the log4j.properties file, the mapred-site.xml and finally the yarn-site.xml. These 5 different piece of data can be seen when we query the configmap.
root@srvr:~/hdfs# kubectl get configmaps NAME DATA AGE hadoopconf 5 88m
When we look at the hdfs.yaml, we will see how this configMap is referenced, and how these configuration files are made available on a specific folder/directory in the application’s containers when the application is launched. First, lets look at the StatefulSet.spec.template.spec.containers.volumeMounts entry:
volumeMounts:
- name: config
mountPath: "/opt/hadoop/etc/hadoop"
This is where the files referenced in config.yaml entries are going to be placed when the container/application is launched. If we look at that volume in more detail in StatefulSet.spec.template.spec.containers.volumes we see the following:
volumes: - name: config configMap: name: hadoopconf
So the configMap in config.yaml, which is named hadoopconf, will placed these 5 configuration files on “/opt/hadoop/etc/hadoop” when the application launches. The application contains and init/bootstrap script which will deploy hadoop using the configuration in these files. A little bit complicated, but sort of neat. You do not need to make any changes here. Instead, we want to change the /data folder to use Persistent Volumes. Let’s see how to do that next.
Persistent Volumes – changes to hfds.yaml
Now there are a few changes needed to the hdfs.yaml file to get it to use persistent volumes. We will need to make some changes to the StatefulSet for the Datanode and the Namenode. First, we will need to add a new mount point for the PV, which for Hadoop, will of course be /data. This will appear in StatefulSet.spec.template.spec.containers.volumeMounts and look like the following:
volumeMounts:
- name: data
mountPath: "/data"
readOnly: false
Next we will need to make some specifications around the volume that we are going to use. For the PV, we are not using a StatefulSet.spec.template.spec.containers.volumes as used by the configMap. Instead we will use the StatefulSet.spec.template.volumeClaimTemplate. This is what that would look like. First we have the Datanode entry, and then the Namenode entry. The differences are the storage class entries and of course the volume sizes. This is how we will use different storage policies on vSAN for the Datanode and Namenode.
volumeClaimTemplates:
- metadata:
name: data
annotations:
volume.beta.kubernetes.io/storage-class: hdfs-dn-sc
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 200Gi
volumeClaimTemplates:
- metadata:
name: data
annotations:
volume.beta.kubernetes.io/storage-class: hdfs-nn-sc
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 50Gi
Storage Classes
We saw in the hdfs.yaml that the Datanode and Namenode referenced two different storage classes. Let’s look at those next. First is the Namenode.
kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: hdfs-nn-sc provisioner: kubernetes.io/vsphere-volume parameters: diskformat: thin storagePolicyName: gold datastore: vsanDatastore
As you can see, the Namenode storageClass uses a gold policy on the vSAN datastore. If we compare it to the Datanode storageClass, you will see that this is the only difference (other than the name). The provisioner is the VMware vSphere Cloud Provider (VCP) which is currently included in K8s distributions but will soon be decoupled along with other built-in drivers as part of the CSI initiative.
kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: hdfs-dn-sc provisioner: kubernetes.io/vsphere-volume parameters: diskformat: thin storagePolicyName: silver datastore: vsanDatastore
vSAN Policies – storagePolicyName
Now you may be wondering why we are using two different policies. This is the beauty of vSAN. As we shall see shortly, the datanodes have their own built-in replication mechanism (3 copies of each block are stored). Thus, it is possible to deploy the Datanode volumes on vSAN without any underlying protection from vSAN (e.g. RAID-0) by simply specifying a policy (silver). This is because if a Datanode fails, there are still two copies of the data blocks.
However if we take the Namenode, it has no such built-in replication or protection feature. Therefore we could offer to protect the underlying persistent volume using an appropriate vSAN policy (e.g. RAID-1, RAID-5). In my example, the gold policy provides this extra protection for the Namenode volumes.
Deployment of Hadoop
Now there are only a few steps to deploying the Hadoop application. (1) Create the storage classes, (2) create the configMap in the config.yaml and (3) create the services and Statefulsets in the hdfs.yaml. All of these can be done by the kubectl create -f “file.yaml” commands.
Post Deployment Checks
The following are a bunch of commands that can be used to validate the state of the constituent components of the application after deployment.
root@srvr:~/cormac-hadoop# kubectl get nodes NAME STATUS ROLES AGE VERSION kubernetes-master Ready,SchedulingDisabled <none> 46h v1.12.0 kubernetes-node1 Ready <none> 46h v1.12.0 kubernetes-node2 Ready <none> 46h v1.12.0 kubernetes-node3 Ready <none> 46h v1.12.0 root@srvr:~/cormac-hadoop# kubectl get configmaps NAME DATA AGE hadoopconf 5 150m root@srvr:~/cormac-hadoop# kubectl get sc NAME PROVISIONER AGE hdfs-sc-dn kubernetes.io/vsphere-volume 40m hdfs-sc-nn kubernetes.io/vsphere-volume 40m root@srvr:~/cormac-hadoop# kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE datanode ClusterIP None <none> 80/TCP 15m kubernetes ClusterIP 10.0.0.1 <none> 443/TCP 46h namenode ClusterIP None <none> 50070/TCP 15m root@srvr:~/cormac-hadoop# kubectl get statefulsets NAME DESIRED CURRENT AGE datanode 1 1 16m namenode 1 1 16m root@srvr:~/cormac-hadoop# kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE dn-data-datanode-0 Bound pvc-5c5f5c5a-f88e-11e8-9e4a-005056970672 200Gi RWO hdfs-sc-dn 10m nn-data-namenode-0 Bound pvc-5c68ed9b-f88e-11e8-9e4a-005056970672 50Gi RWO hdfs-sc-nn 10m root@srvr:~/cormac-hadoop# kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-5c5f5c5a-f88e-11e8-9e4a-005056970672 200Gi RWO Delete Bound default/dn-data-datanode-0 hdfs-sc-dn 10m pvc-5c68ed9b-f88e-11e8-9e4a-005056970672 50Gi RWO Delete Bound default/nn-data-namenode-0 hdfs-sc-nn 10m root@srvr:~/cormac-hadoop# kubectl get statefulsets NAME DESIRED CURRENT AGE datanode 1 1 21m namenode 1 1 21m
Post deploy Hadoop check
Since this is Hadoop, we can very quickly use some Hadoop utilities to check the state of our Hadoop Cluster running on Kubernetes. Here is the output of one such command which generates a report on the HDFS filesystem and also reports on the Datanodes. Note the capacity at the beginning, as we will return to this after scale out.
root@srvr:~/cormac-hadoop# kubectl exec -n default -it namenode-0 \ -- /opt/hadoop/bin/hdfs dfsadmin -report Configured Capacity: 210304475136 (195.86 GB) Present Capacity: 210224738304 (195.79 GB) DFS Remaining: 210224709632 (195.79 GB) DFS Used: 28672 (28 KB) DFS Used%: 0.00% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 0 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Pending deletion blocks: 0 ------------------------------------------------- Live datanodes (1): Name: 172.16.96.5:9866 (172.16.96.5) Hostname: datanode-0.datanode.default.svc.cluster.local Decommission Status : Normal Configured Capacity: 210304475136 (195.86 GB) DFS Used: 28672 (28 KB) Non DFS Used: 62959616 (60.04 MB) DFS Remaining: 210224709632 (195.79 GB) DFS Used%: 0.00% DFS Remaining%: 99.96% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Wed Dec 05 13:17:29 GMT 2018 Last Block Report: Wed Dec 05 13:05:56 GMT 2018 root@srvr:~/cormac-hadoop#
Post deploy check on config files
We mentioned that the purpose of the configMap in the config.yaml was to put in place a set of configuration files that can be used to bootstrap Hadoop. This will show you how to verify that this step has indeed occurred (should you need to troubleshoot at any point). First we will open a bash shell to the Namenode, and then we can navigate to the directory mount point highlighted in the hdfs.yaml to verify that the files exist, which indeed they do in this case.
root@srvr:~/cormac-hadoop# kubectl exec -n default -it namenode-0 \ -- /bin/bash bash-4.3$ cd /opt/hadoop/etc/hadoop bash-4.3$ ls core-site.xml hdfs-site.xml log4j.properties mapred-site.xml yarn-site.xml bash-4.3$ cat core-site.xml <configuration> <property><name>fs.defaultFS</name><value>hdfs://namenode-0:9000</value></property> </configuration>bash-4.3$
Simply type exit to return from the container shell.
Scale out the Datanode statefulSet
We are going to start with the current configuration of 1 Datanode statefulset and 1 Namenode statefulset and we will scale the Datanode statefulset to 3. This should create additional Pods as well as additional persistent volumes and persistent volume claims. Let’s see.
root@srvr:~/cormac-hadoop# kubectl get statefulsets NAME DESIRED CURRENT AGE datanode 1 1 21m namenode 1 1 21m root@srvr:~/cormac-hadoop# kubectl scale --replicas=3 statefulsets/datanode statefulset.apps/datanode scaled root@srvr:~/cormac-hadoop# kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE dn-data-datanode-0 Bound pvc-5c5f5c5a-f88e-11e8-9e4a-005056970672 200Gi RWO hdfs-sc-dn 16m dn-data-datanode-1 Pending hdfs-sc-dn 7s nn-data-namenode-0 Bound pvc-5c68ed9b-f88e-11e8-9e4a-005056970672 50Gi RWO hdfs-sc-nn 16m root@srvr:~/cormac-hadoop# kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-5c5f5c5a-f88e-11e8-9e4a-005056970672 200Gi RWO Delete Bound default/dn-data-datanode-0 hdfs-sc-dn 16m pvc-5c68ed9b-f88e-11e8-9e4a-005056970672 50Gi RWO Delete Bound default/nn-data-namenode-0 hdfs-sc-nn 16m pvc-9cd769f7-f890-11e8-9e4a-005056970672 200Gi RWO Delete Bound default/dn-data-datanode-1 hdfs-sc-dn 6s root@srvr:~/cormac-hadoop# kubectl get pods NAME READY STATUS RESTARTS AGE datanode-0 1/1 Running 0 16m datanode-1 0/1 ContainerCreating 0 24s namenode-0 1/1 Running 0 16m root@srvr:~/cormac-hadoop# kubectl get statefulsets NAME DESIRED CURRENT AGE datanode 3 2 22m namenode 1 1 22m
So changes are occurring, but they will take a little time. Let’s retry those commands again now that some time has passed.
root@srvr:~/cormac-hadoop# kubectl get statefulsets NAME DESIRED CURRENT AGE datanode 3 3 23m namenode 1 1 23m root@srvr:~/cormac-hadoop# kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE dn-data-datanode-0 Bound pvc-5c5f5c5a-f88e-11e8-9e4a-005056970672 200Gi RWO hdfs-sc-dn 18m dn-data-datanode-1 Bound pvc-9cd769f7-f890-11e8-9e4a-005056970672 200Gi RWO hdfs-sc-dn 2m12s dn-data-datanode-2 Bound pvc-b7ce55f9-f890-11e8-9e4a-005056970672 200Gi RWO hdfs-sc-dn 87s nn-data-namenode-0 Bound pvc-5c68ed9b-f88e-11e8-9e4a-005056970672 50Gi RWO hdfs-sc-nn 18m root@srvr:~/cormac-hadoop# kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-5c5f5c5a-f88e-11e8-9e4a-005056970672 200Gi RWO Delete Bound default/dn-data-datanode-0 hdfs-sc-dn 18m pvc-5c68ed9b-f88e-11e8-9e4a-005056970672 50Gi RWO Delete Bound default/nn-data-namenode-0 hdfs-sc-nn 18m pvc-9cd769f7-f890-11e8-9e4a-005056970672 200Gi RWO Delete Bound default/dn-data-datanode-1 hdfs-sc-dn 2m3s pvc-b7ce55f9-f890-11e8-9e4a-005056970672 200Gi RWO Delete Bound default/dn-data-datanode-2 hdfs-sc-dn 86s root@srvr:~/cormac-hadoop# kubectl get pods NAME READY STATUS RESTARTS AGE datanode-0 1/1 Running 0 18m datanode-1 1/1 Running 0 2m19s datanode-2 1/1 Running 0 94s namenode-0 1/1 Running 0 18m root@srvr:~/cormac-hadoop#
root@srvr:~/cormac-hadoop# kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE dn-data-datanode-0 Bound pvc-5c5f5c5a-f88e-11e8-9e4a-005056970672 200Gi RWO hdfs-sc-dn 18m dn-data-datanode-1 Bound pvc-9cd769f7-f890-11e8-9e4a-005056970672 200Gi RWO hdfs-sc-dn 2m12s dn-data-datanode-2 Bound pvc-b7ce55f9-f890-11e8-9e4a-005056970672 200Gi RWO hdfs-sc-dn 87s nn-data-namenode-0 Bound pvc-5c68ed9b-f88e-11e8-9e4a-005056970672 50Gi RWO hdfs-sc-nn 18m root@srvr:~/cormac-hadoop# kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-5c5f5c5a-f88e-11e8-9e4a-005056970672 200Gi RWO Delete Bound default/dn-data-datanode-0 hdfs-sc-dn 18m pvc-5c68ed9b-f88e-11e8-9e4a-005056970672 50Gi RWO Delete Bound default/nn-data-namenode-0 hdfs-sc-nn 18m pvc-9cd769f7-f890-11e8-9e4a-005056970672 200Gi RWO Delete Bound default/dn-data-datanode-1 hdfs-sc-dn 2m3s pvc-b7ce55f9-f890-11e8-9e4a-005056970672 200Gi RWO Delete Bound default/dn-data-datanode-2 hdfs-sc-dn 86s root@srvr:~/cormac-hadoop# kubectl get pods NAME READY STATUS RESTARTS AGE datanode-0 1/1 Running 0 18m datanode-1 1/1 Running 0 2m19s datanode-2 1/1 Running 0 94s namenode-0 1/1 Running 0 18m
Now we can see how the Datanode has scaled out with additional Pods and storage.
Post scale-out application check
And now for the final step – let’s check to see if the HDFS has indeed scaled out with those new Pods and PVs. We will run the same command as before and get an updated report from the application. Note the updated capacity figure and the additional Datanodes.
root@srvr:~/cormac-hadoop# kubectl exec -n default -it namenode-0 \ -- /opt/hadoop/bin/hdfs dfsadmin -report Configured Capacity: 630913425408 (587.58 GB) Present Capacity: 630674206720 (587.36 GB) DFS Remaining: 630674128896 (587.36 GB) DFS Used: 77824 (76 KB) DFS Used%: 0.00% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 0 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Pending deletion blocks: 0 ------------------------------------------------- Live datanodes (3): Name: 172.16.86.5:9866 (172.16.86.5) Hostname: datanode-2.datanode.default.svc.cluster.local Decommission Status : Normal Configured Capacity: 210304475136 (195.86 GB) DFS Used: 24576 (24 KB) Non DFS Used: 62963712 (60.05 MB) DFS Remaining: 210224709632 (195.79 GB) DFS Used%: 0.00% DFS Remaining%: 99.96% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Wed Dec 05 13:24:37 GMT 2018 Last Block Report: Wed Dec 05 13:22:34 GMT 2018 Name: 172.16.88.2:9866 (172.16.88.2) Hostname: datanode-1.datanode.default.svc.cluster.local Decommission Status : Normal Configured Capacity: 210304475136 (195.86 GB) DFS Used: 24576 (24 KB) Non DFS Used: 62963712 (60.05 MB) DFS Remaining: 210224709632 (195.79 GB) DFS Used%: 0.00% DFS Remaining%: 99.96% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Wed Dec 05 13:24:39 GMT 2018 Last Block Report: Wed Dec 05 13:21:57 GMT 2018 Name: 172.16.96.5:9866 (172.16.96.5) Hostname: datanode-0.datanode.default.svc.cluster.local Decommission Status : Normal Configured Capacity: 210304475136 (195.86 GB) DFS Used: 28672 (28 KB) Non DFS Used: 62959616 (60.04 MB) DFS Remaining: 210224709632 (195.79 GB) DFS Used%: 0.00% DFS Remaining%: 99.96% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Wed Dec 05 13:24:38 GMT 2018 Last Block Report: Wed Dec 05 13:05:56 GMT 2018
Checking the Replication Factor of HDFS
Last thing we want to check is to make sure that the Datanodes are indeed replicating. There are a few ways to do this. The following commands create a simple file, and then validate the replication factor. In both cases, the commands return 3 which is the default replication factor for HDFS.
root@srvr:~/cormac-hadoop# kubectl exec -n default -it namenode-0 \ -- /opt/hadoop/bin/hdfs dfs -touchz /out.txt root@srvr:~/cormac-hadoop# kubectl exec -n default -it namenode-0 \ -- /opt/hadoop/bin/hdfs dfs -stat %r /out.txt 3 root@srvr:~/cormac-hadoop# kubectl exec -n default -it namenode-0 \ -- /opt/hadoop/bin/hdfs dfs -ls /out.txt -rw-r--r-- 3 hdfs admin 0 2018-12-05 15:55 /out.txt
Conclusion
That looks to have scaled out just fine. Now there are a few things to keep in mind when dealing with StatefulSets, as per the guidance found here. Deleting and/or scaling a StatefulSet down will not delete the volumes associated with the StatefulSet. This is done to ensure data safety, which is generally more valuable than an automatic purge of all related StatefulSet resources.
With that in mind, I hope this has added some clarity to how we can provide different vSAN policies to different parts of a cloud native application, providing additional protection when the application needs it, but not consuming any additional HCI (hyperconverged infrastructure) resources when the application is able to protect itself through built-in mechanisms.
The post Kubernetes, Hadoop, Persistent Volumes and vSAN appeared first on CormacHogan.com.