Full Apache Hudi setup on a locally hosted k3s cluster.
Find a file
2026-04-19 20:36:27 +02:00
manifests Add readme 2026-04-19 20:36:27 +02:00
README.md Add readme 2026-04-19 20:36:27 +02:00

hudi-k3s

Messing around with Apache Hudi on a local k3s cluster. Wanted to see if I could get a basic data lakehouse setup running on my own homeserver with open source software only without losing my mind. Mostly worked.

What's in here

  • Spark on Kubernetes via the Spark Operator — runs a small PySpark job that writes and reads a Hudi table
  • MinIO as a local S3-compatible object store (because I'm not paying for cloud storage to play around)
  • ArgoCD for GitOps-style deployments — both MinIO and the Spark Operator are managed as ArgoCD apps
  • Cilium network policies with a default-deny setup, because why not practice zero-trust even on localhost

The demo

The Spark job creates a tiny synthetic dataset of trip records, writes them to a Hudi COW table partitioned by city, then reads them back. Nothing fancy — just enough to prove the pipeline works end-to-end.

Rough architecture

ArgoCD
 ├── MinIO (Helm)        -> s3-compatible storage in minio namespace
 └── Spark Operator      -> manages SparkApplication CRDs

hudi namespace
 ├── SparkApplication    -> PySpark job (1 driver + 2 executors)
 ├── ConfigMap           -> the demo Python script
 ├── Secret              -> MinIO credentials
 └── Cilium policies     -> locked-down networking

Prerequisites

  • A running k3s (or k8s) cluster
  • ArgoCD installed
  • Cilium as the CNI (for the network policies)
  • kubectl configured

Usage

Apply the ArgoCD apps and let them sync:

kubectl apply -f manifests/argocd-apps/

Then deploy the Hudi namespace and resources:

kubectl apply -f manifests/hudi/
kubectl apply -f manifests/cilium-policies/

Kick off the Spark job and check the logs:

kubectl get sparkapplication -n hudi
kubectl logs -n hudi hudi-demo-driver

Notes

  • This is just a playground — don't use the default MinIO credentials anywhere real
  • The Spark job pulls Hudi + Hadoop AWS jars from Maven at runtime, so first run needs internet access
  • Tested on a single-node k3s setup on my desktop, your mileage may vary