Skip to content

Latest commit

 

History

History
196 lines (128 loc) · 11.2 KB

README.md

File metadata and controls

196 lines (128 loc) · 11.2 KB

acend infrastructure

IaC for acend kubernetes resources

This repo creates the basic acend infrastructure using Terraform and ArgoCD.

We use Hetzner as our cloud provider and RKE2 to create the kubernetes cluster. Kubernetes Cloud Controller Manager for Hetzner Cloud to provision lobalancer from a Kubernetes service (type Loadbalancer) objects and also configure the networking & native routing for the Kubernetes cluster network traffic.

ArgoCD is used to deploy resourcen on the Kubernetes Cluster

Cluster Autoscaler is used to scale the Kubernetes Cluster beyond the initial minimal cluster size deployed by Terraform.

The minimal cluster size is set to 3 control plane nodes and 2 worker nodes.

Folder structure:

  • deploy: Resources for ArgoCD application deployment
  • terraform: All terraform files for infrastructure deployment

Cluster Creation Workflow

In order to deploy our acend Kubernetes Cluster the following steps are necessary:

  1. Terraform to deploy base infrastructure
    • VM's for controlplane and worker nodes
    • Network
    • Loadbalancer for Kubernetes API and RKE2
    • Firewall
    • Hetzner Cloud Controller Manager for the Kubernetes Cluster Networking
  2. Terraform to deploy and bootstrap ArgoCD
  3. ArgoCD to deploy resources on the Kubernetes Cluster
  4. Cluster Autoscaler to scale the cluster beyond the minimal cluster size created with Terraform.
flowchart LR
    A[Git Repository]
    A --> B{Terraform Cloud}

    B --> C{Hetzner Cloud}

    C -- deploy ---> C1{Loadbalancer}
    C1 -- with service ---> C11{K8s API 6443}
    C1 -- withservice ---> C12{RKE2 API 9345}
    C -- deploy ---> C2{Control Plane VM's}
    C -- deploy ---> C3{Worker VM's}
    C -- deploy ---> C4{Private Network}
    C4 --> C41{Subnet for Nodes}
    C -- deploy ---> C5{Firewall}
    C2 -- configure ---> cloudinit
    C3 -- configure ---> cloudinit
    
    B-- initial bootstrap -->D

    A --> D{ArgoCD + Boostrap Application}

    D -- install -->D1{Applications}

Loading

Operating System

We use Ubuntu 22.04 as our node operating system. Unattended-upgrade for automated security patching is enabled. If necessary, kured will manage node reboots between 21:00 and 23:59:59.

Unattended upgrade is configured to only run on Saturday & Sunday. This is configured by editing the apt-daily-upgrade timer. The override is in /etc/systemd/system/apt-daily-upgrade.timer.d/override.conf and created using cloud-init during deployment.

[Timer]
OnCalendar=
OnCalendar=Sat,Sun *-*-* 02:00:00

Cluster basic Design & Configuration and Setup Procedure

A RKE2 cluster has two types of nodes, a server node with the Kubernetes controlplan and a agent node only with the kubelet.

Our setup is based on the High Availability install instruction:

  • RKE2 config files are initially generated with terrafrom and placed in /etc/rancher/rke2/config.yaml with cloudinit.
  • Token is generated with Terraform (resource "random_password" "rke2_cluster_secret")
  • Cilium is used as the CNI Plugin and configured with the HelmChartConfig in /var/lib/rancher/rke2/server/manifests/rke2-cilium-config.yaml
  • The Kubernetes cluster is kubeproxy free, the functionality is replaced with Cilium. See Kubernetes Without kube-proxy
  • Native Routing is used instead of a tunneling mechanism (e.g. vxlan). The Kubernetes Cloud Controller Manager for Hetzner Cloud is used to manage and provision the network setup (subnet & routing) for the cluster.
  • Control plane nodes are tainted with node-role.kubernetes.io/control-plane:true:NoSchedule. Some of the applications (critical, infrastructure related are scheduled on control plane nodes)

tl;dr; Provision a Kubernetes Cluster with RKE2

See Anatomy of a Next Generation Kubernetes Distribution for more details

  1. Provision LoadBalancer for the Kubernetes API and the RKE2 Supervisor
  2. Provision first controlplane node.
  3. The RKE Supervisor listens on Port 9345/tcp for the other nodes to join the cluster
  4. controlplane node 2 & 3 joins the cluster using the same token and they have set server: https://${lb_address}:9345 in the config file to join the existing cluster.
  5. Provision and join the agent nodes using the same token. They also have set server: https://${lb_address}:9345 to join the existing cluster.
  6. Scale cluster when needed using the cluster autoscaler.

Terraform Configuration

Check Install Terraform for more details on how to use and install the cli.

Terraform Cloud is used for execution of Terraform runs and remote state storage. All secrets required to bootstrap the infrastructure are also stored in Terraform Cloud.

Important variables

The following terraform variables are important:

Root:

  • clustername: The name of the Kubernetes Cluster. This is used as label on the cloud resources for better identification.
  • controlplane_count: The number of controlplane nodes Terraform deploys. This should always be set to 3
  • worker_count: The number of worker nodes Terraform deploys. This should be set to a minimum of 2
  • k8s_api_hostnames: A list of hostnames to be added to the Kubernetes API Certificate
  • extra_ssh_keys: A list of extra SSH keys (besides the one generated in Terraform) to be deployed on the cluster nodes.
  • hcloud_api_token: Hetzner API Token
  • hosttech_dns_token: Hosttech API Token for DNS API
  • hosttech-dns-zone-id": Hosttech ZoneID in which DNS Entry for the k8s API LB are created
  • provider-*: Initially the kubeconfig file is retreived from the first controlplane node and then used to deploy onto the cluster. You can use provider-client-certificate, provider-cluster_ca_certificate, provider-client-key, provider-k8s-api-host instead. Don't forget to change the kubernetes and helm provider in terraform/modules/rke2-cluster/main.tf if you wan't to.
  • first_install: set this to true if its the very first installation. RKE2 requires the very first control plane node to handle special. And also the DNS Records for the Ingress Controller LoadBalancer is only available after ArgoCD has installed the Ingress Contoller. Defaults to false
  • github-app-argocd-clientSecret: Client Secret for the GitHub Oauth App used in ArgoCD for authentication

modules/rke2-cluster (currently not set via root you can change defaults in modules/rke2-cluster/variables.tf)

  • location: The Hetzner location where cloud resources are deployed. Defaults to nbg1
  • rke2_version: the RKE2 version for initial node bootstraping.
  • networkzone: the Hetzner network zone for the private network. Defaults to eu-central
  • lb_type: Load Balancer Type for the K8S API and RKE2 API. Defaults to lb11
  • node_image_type: The image type of all deployed vm's. Defaults to ubuntu-22.04
  • controlplane_type: The node type for the control plane nodes. Defaults to cpx31
  • worker_type: The node type for the worker nodes. Defaults to cpx41
  • cluster-domain: the domain used in Ingress Resources e.g. for ArgoCD.

ArgoCD bootstrap & Configuration

Terraform deploys a ArgoCD Application resource pointing to this repository which will deploy all resources from deploy/bootstrap. The deploy/bootstrap folder contains more ArgoCD Applications resources to deploy all our applications. An application can be deployed using plain Kubernetes resource files, from Kustomize or from Helm Charts. See ArgoCD Documentation for details.

Design decisions:

  • We follow the App of Apps Pattern
  • We use kustomize application. Each application folder in the deploy contains a kustomization.yaml defining all the resources that shall be deployed.
  • Each application folder contains a base folder. To structure multiple parts of an application, subfolders can be used.
  • Each application folder can include a overlay folder if needed (e.g. if this repo is deployed into multiple environments)
  • For Helm Charts we also use kustomize to generate YAML resources out of a Helm Chart

Cluster Access

For the moment, no external authentication provider is included (see #11). We rely on ServiceAccounts and ServiceAccount JWT Tokens to authenticate. RKE2 provides a set of Admin Credentials on intial installation. All other ServiceAccounts and the JWT Tokens are created manually or using the rbac-manager.

See the Create a new ServiceAccount with a JWT Token and cluster-admin privileges to create a new cluster access with cluster-admin privileges.

ci-bot Access

There are twoService Account for automated deployment using a CI/CD System (e.g. Github Actions):

  • ci-bot in Namespace rbac-manager
  • ci-bot-test in Namespace rbac-manager

The ci-bot*s have a Role Binding to the edit Cluster Role in all Namespaces where:

  • for ci-bot the label ci-bot: true and env: prod is set
  • for ci-bot-test the label ci-bot: true and env: test is set

There are two Kyverno ClusterPolicys with name add-ci-bot-label-to-acend-prod-ns & add-ci-bot-label-to-acend-test-ns which automaticly adds the ci-bot: true and correct env Label to all Namespaces with the acend-*-prod or acend-*-test name. But normally, Namespaces are deployed using ArgoCD, therefore the labels should be set there.

In our Github organization a Kubeconfig file for the SA

  • ci-bot is stored as secret with name KUBECONFIG_K8S_ACEND
  • ci-bot-test is stored as secret with name KUBECONFIG_K8S_ACEND_TEST

Certificate & Token rotation

From Certificate Rotation in RKE2.

By default, certificates in RKE2 expire in 12 months. If the certificates are expired or have fewer than 90 days remaining before they expire, the certificates are rotated when RKE2 is restarted.

This results in new Service Account Token and they have to be updated everywhere where needed.

Hetzner Cloud Console

The Hetzner Cloud Console can be accessed via Hetzner Cloud Console. All provisioned resources are assigned to projects. We have the following projects:

Access, API tokens are assigned to projects.

To get access, ask an existing project member to create a new invitation.

Applications

See Applications

How to

See How to

Troubleshooting

See Troubleshooting