IaC for acend kubernetes resources
This repo creates the basic acend infrastructure using Terraform and ArgoCD.
We use Hetzner as our cloud provider and RKE2 to create the kubernetes cluster. Kubernetes Cloud Controller Manager for Hetzner Cloud to provision lobalancer from a Kubernetes service (type Loadbalancer
) objects and also configure the networking & native routing for the Kubernetes cluster network traffic.
ArgoCD is used to deploy resourcen on the Kubernetes Cluster
Cluster Autoscaler is used to scale the Kubernetes Cluster beyond the initial minimal cluster size deployed by Terraform.
The minimal cluster size is set to 3 control plane nodes and 2 worker nodes.
Folder structure:
deploy
: Resources for ArgoCD application deploymentterraform
: All terraform files for infrastructure deployment
In order to deploy our acend Kubernetes Cluster the following steps are necessary:
- Terraform to deploy base infrastructure
- VM's for controlplane and worker nodes
- Network
- Loadbalancer for Kubernetes API and RKE2
- Firewall
- Hetzner Cloud Controller Manager for the Kubernetes Cluster Networking
- Terraform to deploy and bootstrap ArgoCD
- ArgoCD to deploy resources on the Kubernetes Cluster
- Cluster Autoscaler to scale the cluster beyond the minimal cluster size created with Terraform.
flowchart LR
A[Git Repository]
A --> B{Terraform Cloud}
B --> C{Hetzner Cloud}
C -- deploy ---> C1{Loadbalancer}
C1 -- with service ---> C11{K8s API 6443}
C1 -- withservice ---> C12{RKE2 API 9345}
C -- deploy ---> C2{Control Plane VM's}
C -- deploy ---> C3{Worker VM's}
C -- deploy ---> C4{Private Network}
C4 --> C41{Subnet for Nodes}
C -- deploy ---> C5{Firewall}
C2 -- configure ---> cloudinit
C3 -- configure ---> cloudinit
B-- initial bootstrap -->D
A --> D{ArgoCD + Boostrap Application}
D -- install -->D1{Applications}
We use Ubuntu 22.04 as our node operating system. Unattended-upgrade for automated security patching is enabled. If necessary, kured will manage node reboots between 21:00 and 23:59:59.
Unattended upgrade is configured to only run on Saturday & Sunday. This is configured by editing the apt-daily-upgrade
timer. The override is in /etc/systemd/system/apt-daily-upgrade.timer.d/override.conf
and created using cloud-init during deployment.
[Timer]
OnCalendar=
OnCalendar=Sat,Sun *-*-* 02:00:00
A RKE2 cluster has two types of nodes, a server node with the Kubernetes controlplan and a agent node only with the kubelet.
Our setup is based on the High Availability install instruction:
- RKE2 config files are initially generated with terrafrom and placed in
/etc/rancher/rke2/config.yaml
with cloudinit. - Token is generated with Terraform (
resource "random_password" "rke2_cluster_secret"
) - Cilium is used as the CNI Plugin and configured with the
HelmChartConfig
in/var/lib/rancher/rke2/server/manifests/rke2-cilium-config.yaml
- The Kubernetes cluster is kubeproxy free, the functionality is replaced with Cilium. See Kubernetes Without kube-proxy
- Native Routing is used instead of a tunneling mechanism (e.g. vxlan). The Kubernetes Cloud Controller Manager for Hetzner Cloud is used to manage and provision the network setup (subnet & routing) for the cluster.
- Control plane nodes are tainted with
node-role.kubernetes.io/control-plane:true:NoSchedule
. Some of the applications (critical, infrastructure related are scheduled on control plane nodes)
See Anatomy of a Next Generation Kubernetes Distribution for more details
- Provision LoadBalancer for the Kubernetes API and the RKE2 Supervisor
- Provision first controlplane node.
- The RKE Supervisor listens on Port 9345/tcp for the other nodes to join the cluster
- controlplane node 2 & 3 joins the cluster using the same token and they have set
server: https://${lb_address}:9345
in the config file to join the existing cluster. - Provision and join the agent nodes using the same token. They also have set
server: https://${lb_address}:9345
to join the existing cluster. - Scale cluster when needed using the cluster autoscaler.
Check Install Terraform for more details on how to use and install the cli.
Terraform Cloud is used for execution of Terraform runs and remote state storage. All secrets required to bootstrap the infrastructure are also stored in Terraform Cloud.
The following terraform variables are important:
Root:
clustername
: The name of the Kubernetes Cluster. This is used as label on the cloud resources for better identification.controlplane_count
: The number of controlplane nodes Terraform deploys. This should always be set to3
worker_count
: The number of worker nodes Terraform deploys. This should be set to a minimum of2
k8s_api_hostnames
: A list of hostnames to be added to the Kubernetes API Certificateextra_ssh_keys
: A list of extra SSH keys (besides the one generated in Terraform) to be deployed on the cluster nodes.hcloud_api_token
: Hetzner API Tokenhosttech_dns_token
: Hosttech API Token for DNS APIhosttech-dns-zone-id"
: Hosttech ZoneID in which DNS Entry for the k8s API LB are createdprovider-*
: Initially the kubeconfig file is retreived from the first controlplane node and then used to deploy onto the cluster. You can useprovider-client-certificate
,provider-cluster_ca_certificate
,provider-client-key
,provider-k8s-api-host
instead. Don't forget to change thekubernetes
andhelm
provider interraform/modules/rke2-cluster/main.tf
if you wan't to.first_install
: set this totrue
if its the very first installation. RKE2 requires the very first control plane node to handle special. And also the DNS Records for the Ingress Controller LoadBalancer is only available after ArgoCD has installed the Ingress Contoller. Defaults tofalse
github-app-argocd-clientSecret
: Client Secret for the GitHub Oauth App used in ArgoCD for authentication
modules/rke2-cluster (currently not set via root you can change defaults in modules/rke2-cluster/variables.tf
)
location
: The Hetzner location where cloud resources are deployed. Defaults tonbg1
rke2_version
: the RKE2 version for initial node bootstraping.networkzone
: the Hetzner network zone for the private network. Defaults toeu-central
lb_type
: Load Balancer Type for the K8S API and RKE2 API. Defaults tolb11
node_image_type
: The image type of all deployed vm's. Defaults toubuntu-22.04
controlplane_type
: The node type for the control plane nodes. Defaults tocpx31
worker_type
: The node type for the worker nodes. Defaults tocpx41
cluster-domain
: the domain used in Ingress Resources e.g. for ArgoCD.
Terraform deploys a ArgoCD Application
resource pointing to this repository which will deploy all resources from deploy/bootstrap
. The deploy/bootstrap
folder contains more ArgoCD Applications
resources to deploy all our applications. An application can be deployed using plain Kubernetes resource files, from Kustomize or from Helm Charts. See ArgoCD Documentation for details.
Design decisions:
- We follow the App of Apps Pattern
- We use kustomize application. Each application folder in the
deploy
contains akustomization.yaml
defining all the resources that shall be deployed. - Each application folder contains a
base
folder. To structure multiple parts of an application, subfolders can be used. - Each application folder can include a
overlay
folder if needed (e.g. if this repo is deployed into multiple environments) - For Helm Charts we also use kustomize to generate YAML resources out of a Helm Chart
For the moment, no external authentication provider is included (see #11). We rely on ServiceAccounts and ServiceAccount JWT Tokens to authenticate. RKE2 provides a set of Admin Credentials on intial installation. All other ServiceAccounts and the JWT Tokens are created manually or using the rbac-manager.
See the Create a new ServiceAccount with a JWT Token and cluster-admin
privileges to create a new cluster access with cluster-admin
privileges.
There are twoService Account for automated deployment using a CI/CD System (e.g. Github Actions):
ci-bot
in Namespacerbac-manager
ci-bot-test
in Namespacerbac-manager
The ci-bot*
s have a Role Binding to the edit
Cluster Role in all Namespaces where:
- for
ci-bot
the labelci-bot: true
andenv: prod
is set - for
ci-bot-test
the labelci-bot: true
andenv: test
is set
There are two Kyverno ClusterPolicy
s with name add-ci-bot-label-to-acend-prod-ns
& add-ci-bot-label-to-acend-test-ns
which automaticly adds the ci-bot: true
and correct env
Label to all Namespaces with the acend-*-prod
or acend-*-test
name. But normally, Namespaces are deployed using ArgoCD, therefore the labels should be set there.
In our Github organization a Kubeconfig file for the SA
ci-bot
is stored as secret with nameKUBECONFIG_K8S_ACEND
ci-bot-test
is stored as secret with nameKUBECONFIG_K8S_ACEND_TEST
From Certificate Rotation in RKE2.
By default, certificates in RKE2 expire in 12 months. If the certificates are expired or have fewer than 90 days remaining before they expire, the certificates are rotated when RKE2 is restarted.
This results in new Service Account Token and they have to be updated everywhere where needed.
The Hetzner Cloud Console can be accessed via Hetzner Cloud Console. All provisioned resources are assigned to projects. We have the following projects:
Access, API tokens are assigned to projects.
To get access, ask an existing project member to create a new invitation.
See Applications
See How to
See Troubleshooting