Full Apache Hudi setup on a locally hosted k3s cluster.
| manifests | ||
| README.md | ||
hudi-k3s
Messing around with Apache Hudi on a local k3s cluster. Wanted to see if I could get a basic data lakehouse setup running on my own homeserver with open source software only without losing my mind. Mostly worked.
What's in here
- Spark on Kubernetes via the Spark Operator — runs a small PySpark job that writes and reads a Hudi table
- MinIO as a local S3-compatible object store (because I'm not paying for cloud storage to play around)
- ArgoCD for GitOps-style deployments — both MinIO and the Spark Operator are managed as ArgoCD apps
- Cilium network policies with a default-deny setup, because why not practice zero-trust even on localhost
The demo
The Spark job creates a tiny synthetic dataset of trip records, writes them to a Hudi COW table partitioned by city, then reads them back. Nothing fancy — just enough to prove the pipeline works end-to-end.
Rough architecture
ArgoCD
├── MinIO (Helm) -> s3-compatible storage in minio namespace
└── Spark Operator -> manages SparkApplication CRDs
hudi namespace
├── SparkApplication -> PySpark job (1 driver + 2 executors)
├── ConfigMap -> the demo Python script
├── Secret -> MinIO credentials
└── Cilium policies -> locked-down networking
Prerequisites
- A running k3s (or k8s) cluster
- ArgoCD installed
- Cilium as the CNI (for the network policies)
kubectlconfigured
Usage
Apply the ArgoCD apps and let them sync:
kubectl apply -f manifests/argocd-apps/
Then deploy the Hudi namespace and resources:
kubectl apply -f manifests/hudi/
kubectl apply -f manifests/cilium-policies/
Kick off the Spark job and check the logs:
kubectl get sparkapplication -n hudi
kubectl logs -n hudi hudi-demo-driver
Notes
- This is just a playground — don't use the default MinIO credentials anywhere real
- The Spark job pulls Hudi + Hadoop AWS jars from Maven at runtime, so first run needs internet access
- Tested on a single-node k3s setup on my desktop, your mileage may vary