Full Apache Hudi setup on a locally hosted k3s cluster.

Find a file

leon 74a9030375 Add readme		2026-04-19 20:36:27 +02:00
manifests	Add readme	2026-04-19 20:36:27 +02:00
README.md	Add readme	2026-04-19 20:36:27 +02:00

README.md

hudi-k3s

Messing around with Apache Hudi on a local k3s cluster. Wanted to see if I could get a basic data lakehouse setup running on my own homeserver with open source software only without losing my mind. Mostly worked.

What's in here

Spark on Kubernetes via the Spark Operator — runs a small PySpark job that writes and reads a Hudi table
MinIO as a local S3-compatible object store (because I'm not paying for cloud storage to play around)
ArgoCD for GitOps-style deployments — both MinIO and the Spark Operator are managed as ArgoCD apps
Cilium network policies with a default-deny setup, because why not practice zero-trust even on localhost

The demo

The Spark job creates a tiny synthetic dataset of trip records, writes them to a Hudi COW table partitioned by city, then reads them back. Nothing fancy — just enough to prove the pipeline works end-to-end.

Rough architecture

ArgoCD
 ├── MinIO (Helm)        -> s3-compatible storage in minio namespace
 └── Spark Operator      -> manages SparkApplication CRDs

hudi namespace
 ├── SparkApplication    -> PySpark job (1 driver + 2 executors)
 ├── ConfigMap           -> the demo Python script
 ├── Secret              -> MinIO credentials
 └── Cilium policies     -> locked-down networking

Prerequisites

A running k3s (or k8s) cluster
ArgoCD installed
Cilium as the CNI (for the network policies)
kubectl configured

Usage

Apply the ArgoCD apps and let them sync:

kubectl apply -f manifests/argocd-apps/

Then deploy the Hudi namespace and resources:

kubectl apply -f manifests/hudi/
kubectl apply -f manifests/cilium-policies/

Kick off the Spark job and check the logs:

kubectl get sparkapplication -n hudi
kubectl logs -n hudi hudi-demo-driver

Notes

This is just a playground — don't use the default MinIO credentials anywhere real
The Spark job pulls Hudi + Hadoop AWS jars from Maven at runtime, so first run needs internet access
Tested on a single-node k3s setup on my desktop, your mileage may vary