From f8f89a70e51d1e575b2408545c87f95b33828e89 Mon Sep 17 00:00:00 2001
From: Tom Wieczorek <twieczorek@mirantis.com>
Date: Thu, 28 Nov 2024 16:37:33 +0100
Subject: [PATCH 1/2] Move troubleshooting docs into a subfolder

So that the file structure follows the document structure a bit.

Signed-off-by: Tom Wieczorek <twieczorek@mirantis.com>
---
 docs/runtime.md                               | 2 +-
 docs/{ => troubleshooting}/FAQ.md             | 0
 docs/{ => troubleshooting}/logs.md            | 0
 docs/{ => troubleshooting}/support-dump.md    | 2 +-
 docs/{ => troubleshooting}/troubleshooting.md | 4 ++--
 mkdocs.yml                                    | 8 ++++----
 6 files changed, 8 insertions(+), 8 deletions(-)
 rename docs/{ => troubleshooting}/FAQ.md (100%)
 rename docs/{ => troubleshooting}/logs.md (100%)
 rename docs/{ => troubleshooting}/support-dump.md (96%)
 rename docs/{ => troubleshooting}/troubleshooting.md (91%)

diff --git a/docs/runtime.md b/docs/runtime.md
index 7a19bafc4679..3df95e8b9b4f 100644
--- a/docs/runtime.md
+++ b/docs/runtime.md
@@ -266,7 +266,7 @@ metrics][cadvisor-metrics] when using cri-dockerd.
 [install cri-dockerd]: https://github.com/Mirantis/cri-dockerd#using-cri-dockerd
 [worker profiles]: worker-node-config.md#worker-profiles
 [dynamic configuration]: dynamic-configuration.md
-[cadvisor-metrics]: ./troubleshooting.md#using-a-custom-container-runtime-and-missing-labels-in-prometheus-metrics
+[cadvisor-metrics]: ./troubleshooting/troubleshooting.md#using-a-custom-container-runtime-and-missing-labels-in-prometheus-metrics
 
 #### Verification
 
diff --git a/docs/FAQ.md b/docs/troubleshooting/FAQ.md
similarity index 100%
rename from docs/FAQ.md
rename to docs/troubleshooting/FAQ.md
diff --git a/docs/logs.md b/docs/troubleshooting/logs.md
similarity index 100%
rename from docs/logs.md
rename to docs/troubleshooting/logs.md
diff --git a/docs/support-dump.md b/docs/troubleshooting/support-dump.md
similarity index 96%
rename from docs/support-dump.md
rename to docs/troubleshooting/support-dump.md
index 6908d7d8c139..bc2fe84cee76 100644
--- a/docs/support-dump.md
+++ b/docs/troubleshooting/support-dump.md
@@ -1,6 +1,6 @@
 # Support Insight
 
-In many cases, especially when looking for [commercial support](commercial-support.md) there's a need for share the cluster state with other people.
+In many cases, especially when looking for [commercial support](../commercial-support.md) there's a need for share the cluster state with other people.
 While one could always give access to the live cluster that is not always desired nor even possible.
 
 For those kind of cases we can lean on the work our friends at [troubleshoot.sh](https://troubleshoot.sh) have done.
diff --git a/docs/troubleshooting.md b/docs/troubleshooting/troubleshooting.md
similarity index 91%
rename from docs/troubleshooting.md
rename to docs/troubleshooting/troubleshooting.md
index 43e4360bef44..b1863eca0017 100644
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting/troubleshooting.md
@@ -67,7 +67,7 @@ io.containerd.snapshotter.v1    zfs                      linux/amd64    ok
 ...
 ```
 
-- create a containerd config according to the [documentation](runtime.md): `$ containerd config default > /etc/k0s/containerd.toml`
+- create a containerd config according to the [documentation](../runtime.md): `$ containerd config default > /etc/k0s/containerd.toml`
 - modify the line in `/etc/k0s/containerd.toml`:
 
 ```toml
@@ -92,7 +92,7 @@ to
 
 ## Pods pending when using cloud providers
 
-Once we enable [cloud provider support](cloud-providers.md) on kubelet on worker nodes, kubelet will automatically add a taint `node.cloudprovider.kubernetes.io/uninitialized` for the node. This tain will prevent normal workloads to be scheduled on the node until the cloud provider controller actually runs second initialization on the node and removes the taint. This means that these nodes are not available for scheduling until the cloud provider controller is actually successfully running on the cluster.
+Once we enable [cloud provider support](../cloud-providers.md) on kubelet on worker nodes, kubelet will automatically add a taint `node.cloudprovider.kubernetes.io/uninitialized` for the node. This tain will prevent normal workloads to be scheduled on the node until the cloud provider controller actually runs second initialization on the node and removes the taint. This means that these nodes are not available for scheduling until the cloud provider controller is actually successfully running on the cluster.
 
 For troubleshooting your specific cloud provider see its documentation.
 
diff --git a/mkdocs.yml b/mkdocs.yml
index 16dd9d0231a2..5af30767b0e5 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -68,10 +68,10 @@ nav:
       - GitOps with Flux: examples/gitops-flux.md
       - OpenEBS storage: examples/openebs.md
   - Troubleshooting:
-      - FAQ: FAQ.md
-      - Logs: logs.md
-      - Common Pitfalls: troubleshooting.md
-      - Support Insights: support-dump.md
+      - FAQ: troubleshooting/FAQ.md
+      - Logs: troubleshooting/logs.md
+      - Common Pitfalls: troubleshooting/troubleshooting.md
+      - Support Insights: troubleshooting/support-dump.md
   - Reference:
       - Architecture: architecture/index.md
       - Command Line: cli/README.md

From 892415543386f149f7ffca61a5b9b5ec77431369 Mon Sep 17 00:00:00 2001
From: Tom Wieczorek <twieczorek@mirantis.com>
Date: Fri, 29 Nov 2024 08:53:57 +0100
Subject: [PATCH 2/2] Add troubleshooting section on regenerating CAs

This describes the "offline" version of the process. While this can be
done in multiple passes with less downtime but much more complexity and
work, let's start with the supposedly "simplest" alternative.

Signed-off-by: Tom Wieczorek <twieczorek@mirantis.com>
---
 docs/custom-ca.md                             |  4 +
 docs/k0s-multi-node.md                        |  4 +-
 docs/troubleshooting/FAQ.md                   |  9 ++
 .../certificate-authorities.md                | 96 +++++++++++++++++++
 mkdocs.yml                                    |  1 +
 5 files changed, 112 insertions(+), 2 deletions(-)
 create mode 100644 docs/troubleshooting/certificate-authorities.md

diff --git a/docs/custom-ca.md b/docs/custom-ca.md
index 6805a02a3387..d701cfccfc18 100644
--- a/docs/custom-ca.md
+++ b/docs/custom-ca.md
@@ -38,3 +38,7 @@ Here's an example of a command for pre-generating a token for a controller.
 ```shell
 k0s token pre-shared --role controller --cert /var/lib/k0s/pki/ca.crt --url https://<controller-ip>:9443/
 ```
+
+## See also
+
+- [Certificate Authorities](troubleshooting/certificate-authorities.md)
diff --git a/docs/k0s-multi-node.md b/docs/k0s-multi-node.md
index be253feba363..97b36a25f62d 100644
--- a/docs/k0s-multi-node.md
+++ b/docs/k0s-multi-node.md
@@ -64,7 +64,7 @@ To get a token, run the following command on one of the existing controller node
 sudo k0s token create --role=worker
 ```
 
-The resulting output is a long [token](#about-tokens) string, which you can use to add a worker to the cluster.
+The resulting output is a long [token](#about-join-tokens) string, which you can use to add a worker to the cluster.
 
 For enhanced security, run the following command to set an expiration time for the token:
 
@@ -84,7 +84,7 @@ sudo k0s install worker --token-file /path/to/token/file
 sudo k0s start
 ```
 
-#### About tokens
+#### About join tokens
 
 The join tokens are base64-encoded [kubeconfigs](https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/) for several reasons:
 
diff --git a/docs/troubleshooting/FAQ.md b/docs/troubleshooting/FAQ.md
index 49d6a9ab19ee..dadae6005ae1 100644
--- a/docs/troubleshooting/FAQ.md
+++ b/docs/troubleshooting/FAQ.md
@@ -31,3 +31,12 @@ As a default, the control plane does not run kubelet at all, and will not accept
 ## Is k0sproject really open source?
 
 Yes, k0sproject is 100% open source. The source code is under Apache 2 and the documentation is under the Creative Commons License. Mirantis, Inc. is the main contributor and sponsor for this OSS project: building all the binaries from upstream, performing necessary security scans and calculating checksums so that it's easy and safe to use. The use of these ready-made binaries are subject to Mirantis EULA and the binaries include only open source software.
+
+## A kubeconfig created via [`k0s kubeconfig`](../cli/k0s_kubeconfig.md) has been leaked, what can I do?
+
+Kubernetes does not support certificate revocation (see [k/k/18982]). This means
+that you cannot disable the leaked credentials. The only way to effectively
+revoke them is to [replace the Kubernetes CA] for your cluster.
+
+[k/k/18982]: https://github.com/kubernetes/kubernetes/issues/18982
+[replace the Kubernetes CA]: certificate-authorities.md#replacing-the-kubernetes-ca-and-sa-key-pair
diff --git a/docs/troubleshooting/certificate-authorities.md b/docs/troubleshooting/certificate-authorities.md
new file mode 100644
index 000000000000..ec8ec9c89434
--- /dev/null
+++ b/docs/troubleshooting/certificate-authorities.md
@@ -0,0 +1,96 @@
+# Certificate Authorities (CAs)
+
+## Overview of CAs managed by k0s
+
+k0s maintains two Certificate Authorities and one public/private key pair:
+
+* The **Kubernetes CA** is used to secure the Kubernetes cluster and manage
+  client and server certificates for API communication.
+* The **etcd CA** is used only when managed etcd is enabled, for securing etcd
+  communications.
+* The **Kubernetes Service Account (SA) key pair** is used for signing
+  Kubernetes [service account tokens].
+
+These CAs are automatically created during cluster initialization and have a
+default expiration period of 10 years. They are distributed once to all k0s
+controllers as part of k0s's [join process]. Replacing them is a manual process,
+as k0s currently lacks automation for CA renewal.
+
+[service account tokens]: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/
+[join process]: ../k0s-multi-node.md#5-add-controllers-to-the-cluster
+
+## Replacing the Kubernetes CA and SA key pair
+
+The following steps describe a way how to manually replace the Kubernetes CA and
+SA key pair by taking a cluster down, regenerating those and redistributing them
+to all nodes, and then bringing the cluster back online:
+
+1. Take a [backup]! Things might go wrong at any level.
+
+2. Stop k0s on all worker and controller nodes. All the instructions below
+   assume that all k0s nodes are using the default data directory
+   `/var/lib/k0s`. Please adjust accordingly if you're using a different data
+   directory path.
+
+3. Delete the Kubernetes CA and SA key pair files from the all the controller
+   data directories:
+
+   * `/var/lib/k0s/pki/ca.crt`
+   * `/var/lib/k0s/pki/ca.key`
+   * `/var/lib/k0s/pki/sa.pub`
+   * `/var/lib/k0s/pki/sa.key`
+
+   Delete the kubelet's kubeconfig file and the kubelet's PKI directory from all
+   worker data directories. Note that this includes controllers that have been
+   started with the `--enable-worker` flag:
+
+   * `/var/lib/k0s/kubelet.conf`
+   * `/var/lib/k0s/kubelet/pki`
+
+4. Choose one controller as the "first" one. Restart k0s on the first
+   controller. If this controller is running with the `--enable-worker` flag,
+   you should **reboot the machine** instead. This will ensure that all
+   processes and pods will be cleanly restarted. After the restart, k0s will
+   have regenerated a new Kubernetes CA and SA key pair.
+
+5. Distribute the new CA and SA key pair to the other controllers: Copy over the
+   following files from the first controller to each of the remaining
+   controllers:
+
+   * `/var/lib/k0s/pki/ca.crt`
+   * `/var/lib/k0s/pki/ca.key`
+   * `/var/lib/k0s/pki/sa.pub`
+   * `/var/lib/k0s/pki/sa.key`
+
+   After copying the files, the new CA and SA key pair are in place. Restart k0s
+   on the other controllers. For controllers running with the `--enable-worker`
+   flag, **reboot the machines** instead.
+
+6. Rejoin all workers. The easiest way to do this is to use a
+   `kubelet-bootstrap.conf` file. You can [generate](../cli/k0s_token_create.md)
+   such a file on a controller like this (see the section on [join tokens] for
+   details):
+
+   ```sh
+   touch /tmp/rejoin-token &&
+     chmod 0600 /tmp/rejoin-token &&
+     k0s token create --expiry 1h |
+     base64 -d |
+     gunzip >/tmp/rejoin-token
+   ```
+
+   Copy that token to each worker node and place it at
+   `/var/lib/k0s/kubelet-bootstrap.conf`. Then reboot the machine.
+
+7. When all workers are back online, the `kubelet-bootstrap.conf` files can be
+   safely removed from the workers. You can also invalidate the token so you
+   don't have to wait for it to expire: Use [`k0s token list --role
+   worker`](../cli/k0s_token_list.md) to list all tokens and [`k0s token
+   invalidate <token-id>`](../cli/k0s_token_invalidate.md) to invalidate them immediately.
+
+[backup]: ../backup.md
+[join tokens]: ../k0s-multi-node.md#about-join-tokens
+
+## See also
+
+* [Install using custom CAs](../custom-ca.md)
diff --git a/mkdocs.yml b/mkdocs.yml
index 5af30767b0e5..0a41aebf5744 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -72,6 +72,7 @@ nav:
       - Logs: troubleshooting/logs.md
       - Common Pitfalls: troubleshooting/troubleshooting.md
       - Support Insights: troubleshooting/support-dump.md
+      - Certificate Authorities (CAs): troubleshooting/certificate-authorities.md
   - Reference:
       - Architecture: architecture/index.md
       - Command Line: cli/README.md