An upgrade to Jammy Jellyfish


The story of how I upgaded Ubuntu to 22.04 on Nodes in Kubernetes cluster

Introduction

I have a Kubernetes cluster that was built in similar way that I described at that post. And everything works as expected but operation system on nodes is Ubuntu 20.04. Yeah 20.04 is LTS release and it have 5 years of security maintenance. So it should works well at least a year more. But I though that could be great idea to upgrade at least to 22.04 that is LTS too and it will gives two extra years of updates up to April 2027 according to Ubuntu release cycle.

But from my experience I remember that not everything can go smoothly when updating between releases. But I've had it happen a couple of times, but I don't remember if it was Ubuntu. Bottom line and really not everything went smoothly, so I decided to put some of the nuances in this post.

Plan

So I have a Kubernetes cluster that was built with kubeadm with three control-plane nodes on it. These nodes are on VPS hosting. Beside my pet projects as this site there's nothing useful on that cluster. So the plan was next:

  1. Making a backup of data that could be useful if I break everything
  2. I'm unloading a first control-plane Node.
  3. Trying to upgrade a host system on that node.
    If something went wrong on that stage I will try to find what is issue and fix it.
  4. Checking that everything okay after a while
    It is possible that everything is already broken, but these broken things pretend to work correctly for some time. If something went wrong and it's not clear how to fix it, I add a new control-plane node to the cluster immediately with fresh Ubuntu 22.04.
  5. Continue with other nodes one by one.
    Or recreate the cluster and rebuild everything from the backups if process of upgrade went wrong.

In general there are two approaches:

1. Is to read all the documentation, find all the changes that might break something. And prepare for everything.

2. Or start doing something and solve problems as they arise.

In any case, even with the first approach you should be ready for unexpected problems. So after building a plan and taking data backups and checking them, I decided to move on to the upgrade.

All resources on the cluster are deployed via FluxCD. So GitOps is another thing that has allowed me to be brave.

Upgrading Ubuntu

So I made backup of data from cluster and check that it's valid. There's also recommendation to make a backup of ETCD cluster before intervention on control-plane nodes. But I didn't do it. Since there's two more control-plane nodes and they should tolerate fail of a single control-plane node.

Then took a load off the Node

kubectl drain --ignore-daemonsets <NODE>

Then connected to that drained Node via SSH and get a message with invitation to upgrade:

New release '22.04.3 LTS' available.
Run 'do-release-upgrade' to upgrade to it.

But probably that message could depend from image customization of vps provider.

I have tried to run that command But that command wont work since updating packages is required. And perhaps an upgrade of the Kubernetes cluster itself will also be necessary. But there's a detailed page in the docs about it.

After updating packages I run

do-release-upgrade

once more and it able to continue process.

Then I got a message

This session appears to be running under ssh. It is not recommended
to perform a upgrade over ssh currently because in case of failure it
 is harder to recover.

If you continue, an additional ssh daemon will be started at port
'1022'.
Do you want to continue?

Continue [yN]

On that step probably it's good idea to check firewall and make sure that port 1022 is opened. Maybe in the last time I forgot to do that.

Type y and then click [enter]

Next message with information how to open firewall.

To make recovery in case of failure easier, an additional sshd will
be started on port '1022'. If anything goes wrong with the running
ssh you can still connect to the additional one.
If you run a firewall, you may need to temporarily open this port. As
this is potentially dangerous it's not done automatically. You can
open the port with e.g.:
'iptables -I INPUT -p tcp --dport 1022 -j ACCEPT'

To continue please press [ENTER]

But if those firewall rules managed by CNI. For example calico can do it. It's better to update those firewall rules via network policy. It's also possible that hoster have own firewall around your vps and it could be managed via its admin panel.

I pressed [ENTER]

Do you want to rewrite your 'sources.list' file anyway? If you choose
'Yes' here it will update all 'focal' to 'jammy' entries.
If you select 'No' the upgrade will cancel.

Continue [yN]

Seems it's what I want. So choose yes.

Next

There are services installed on your system which need to be restarted when certain libraries, such as libpam, libc, and libssl, are upgraded.
...
Restart services during package upgrades without asking?

Here again yes.

Remove obsolete packages?

I'm okay with that, again yes.

System upgrade is complete.

Restart required

To finish the upgrade, a restart is required.
If you select 'y' the system will be restarted.

I choose yes and system was updated and restarted.

Checking

First I would like to check that Node was upgraded. Connect to it by ssh and check versions

cat /etc/os-release

And can see that the Ubuntu was upgraded

PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Next step is checking that all nodes in ready status in the cluster.

kubectl get nodes

In response I was able to see that all nodes appears in ready status.

Next I would like to check that all system pods started correctly.

kubectl -n kube-system get pods -o wide

Some pods appears in the state CrashLoopBackOff. After a few minutes I still able to see that some pods in the state CrashLoopBackOff. Waiting a little bit longer seems doesn't solve that issue.

The problem

I checked state journal of kubelet

journalctl -u kubelet -f

And found a strange message, mainly because I doesn't seen the similar message on other Nodes at that moment

kubelet[847]: E0425 07:23:21.243398     847 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"engine-image-ei-ebe8de04\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=e

I have tried to search in internet by this message. Found few solutions but I didn't save links to those solutions. In short, one of them advised to reinitialize the node. And the other one advised not to use systemd cgroup driver.

Well, I can always reinitialize a node. And I didn't want to give up something unnecessarily. So I decided to look through the official Kubernetes documentation using the keywords systemd and cgroup. And I found a section Configuring the systemd cgroup driver. The words “The systemd cgroup driver is recommended if you use cgroup v2” immediately caught my eye.

Then I checked old and new nodes and comparing cgroup version.

stat -fc %T /sys/fs/cgroup/
cgroup2fs # on new
tmpfs # on old

Seems that problem related to the misconfiguration of containerd and just have to update it's settings as described in the docs. In my case configuration of containerd is located at /etc/containerd/config.toml. I update a single line to SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  ...
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    SystemdCgroup = true

After that I restarted containerd

systemctl restart containerd

And wait for a while to make sure that there's no unexpected restarts of pods.

Then I allowed non-system pods on the Node to run

kubectl uncordon <NODE>

Сonclusion

After repeating the described procedure on all nodes, the cluster continues to work as before. In the end it turned out that there was no problem with the Ubuntu update, but there was a small difficulty with cgroup. This did not affect the cluster operation.

It also seemed unusual advice found on the Internet to solve this problem, it's good that I decided to look in the official documentation, otherwise it would most likely be a sub-optimal way via re-initializing Nodes. It seems that even people on the internet should not be taken at their word, and not to even mention AI helpers.