So you have a cluster and you have some persistent volumes on a few of the nodes. Now let's say you want to decommission one of the nodes but it has volumes on it your pods make use of. You'd think this would be simple enough with a popular platform like k8s, but you'd be wrong.
⚠️ Disclaimer: At the time I solved this problem I didn't do thorough research and didn't know about pv-migrate. That tool, which has raving reviews btw, is great, but automates a bit less than I needed for my use case. Still, rather use pv-migrate than roll your own like I did.
If you're using a cloud-backed block storage like AWS EBS then this isn't such a big problem. Basically, with a few AWS commands you can detach and re-attach a volume to another EC2 node. But if you don't trust AWS or don't like what they stand for, or both, like myself, then you're running nodes a cheaper cloud provider that doesn't grant such luxuries. In fact, in my case I had nodes running in both Germany and South Africa, communicating over the open internet (bad security posture I know). I wanted to decom my SA nodes and move their volumes to my German nodes.
How do volumes work in k8s? Here's the summary:
Big oversimplification, but as long as you get the idea: Usually you create a PVC along with a Deployment (or whatever pod controller you like), and then indicate on the pod that it should mount a PVC (.spec.volumes.*.persistentVolumeClaim). Once the pod is created for the first time, whichever storage controller you have set up will provision a persistent volume on one of the nodes, and the pod will start running on that node and mount that volume. Again, this is only one way of doing persistent volumes on k8s, and it is a simplification.
What are we trying to do here? Ideally this:
- Duplicate a given volume from one node onto another.
- Minimal configuration changes. I want to scale down the deployment, do the volume copy, then scale the deployment back up and have it run on the new node with the newly-copied volume.
Challenge 1: Copy the volume contents from one node to another
My strategy (hilariously) assumes pod-to-pod network encryption, and basically is a dead-simple tar-copy.
The reader pod will naturally run on the node where the source volume is, but to ensure the destination volume is provisioned on the correct node, be sure to set the nodeSelector on the writer pod.
This approach works quite well for up to 10GB or so. I imagine for larger volumes you'd need a more reliable file transfer solution that can handle retries and so on. I'll leave that as an exercise to the reader.
Challenge 2: Modifying a PVC in-place
Certain fields you actually can't modify in-place because of api-server-level restrictions. But that just means we have to get creative 🙂
So to start off with let's assume our little tar-copy job above copied the data into a new PVC and PV on the destination node. I call the destination pvc tmp-copy, as we'll delete it later, and replace it with the source PVC's config.
- Save the full source PVC config (we'll recreate it later)
kubectl get pvc "$SRC_PVC_NAME" -o yaml > /tmp/src_pvc.yaml2. "Detach" the source pv from its pvc, then delete the pvc.
kubectl patch pv $SRC_PV -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
kubectl patch pvc $SRC_PVC_NAME -p '{"metadata":{"finalizers":null}}'
kubectl delete pvc $SRC_PVC_NAME3. Do the same with the destination pv and its pvc
kubectl patch pv $DEST_PV -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
kubectl patch pvc tmp-copy -p '{"metadata":{"finalizers":null}}'4. Now that we only have two loose PV's (and, crucially, the yaml config of the original PVC) we just need to re-create the PVC, attaching it onto the destination PV:
cat /tmp/src_pvc.yaml | yq -Y ".spec.volumeName = \"$DEST_PV\"" | kubectl apply -f -
# Update PVC node annotation
kubectl annotate --overwrite pvc "$SRC_PVC_NAME" "volume.kubernetes.io/selected-node=$DEST_NODE"5. Cleanup
# Delete last-applied-configuration so that we don't confuse argo
kubectl annotate --overwrite pvc "$SRC_PVC_NAME" kubectl.kubernetes.io/last-applied-configuration-
# Optional: yeet the original pv
kubectl delete pv $SRC_PVAnd that's it.
Source please
Here you go. I split the code into 3 parts:
Part 0: Create an orchestrator job with the needed permissions to manage this whole operation. I intentionally wanted this to run fully in-cluster to make it a bit more reliable.
Part 1: Create the reader pod
Part 2: Create and run the writer pod
After the file copy is complete the orchestrator cleans up the reader and writer pods before doing the PVC re-attachment magic.
And there you have it. A fully automated pv-copy strategy that took me weeks to figure out, nicely condensed for you into a 5-min read.
