Learning Kubernetes
Moving From Docker into Kubernetes
A few years ago I spent a few days just playing with and reading up on docker, understanding how it fit together properly. I'd used it for a couple of years in the context of Unraid, and had cobbled together custom virtual networks and weird hacked-together script containers to try get everything to work correctly. The interface of unraid hid most of the complexities, but also the power away. It seemed to work though, and I honestly have no idea how it worked so well for so long. Looking back at it, there's no way it should have.
It then became an excellent solution for building our games (and dealing with some other tools & environments) at work - we had just moved to Gitlab, so wanted CI with the image containing build tools. Which meant I had to learn it properly enough to use in a professional environment. While I'm alright with using some hacky things where needed, when it comes to something as important as live builds I want it done as "properly" as I can.
I feel like I had a good grasp on how docker worked already, so pretty quickly felt comfortable with it. I quickly had a bunch of nice images set up in the work repos, the tooling was all auto building and connecting to AWS. The whole flow was pretty smooth.
One thing I really could not understand however, was the need for Kubernetes in anything smaller a massive web service or when uptime really matters. For us, a little server running our maths engine that got hit a few hundred times a day for test purposes was just handled by a docker-compose being copied over to a EC2 server and "upped". It recovered itself if a container died (although uptime was over half a year). With a couple of script commands from the CI, it would update happily. There was no need for anything like k8s.
Starting off with documentation
However at work we then got a bunch of old server machines in a rack from a client. So with that in mind I thought well screw it, I'll go to the effort of learning k8s. The idea of our maths runner being able to connect to a database instead of spitting out a local file and then we running billions of runs over a load of high core machines in a day, saving the data in a centralised place was a pretty nice one. Also I like the challenge of cool little projects like this. Plus the knowledge will then also let us run some gitlab runners to be able to build every commit of our assets, as well as potentially off-load our retail builds.
Finding out exactly what Kubernetes is was a challenge by itself. I understood it as a system that basically spits containers across servers and looks after them (ensures they're up, rolls out updates and keeps the data safe). Effectively a fancy docker compose. However, multiple chats with a friend that works in dev ops really confused me to what they were for. Saying my understanding wasn't correct and it was an "orcestration system", and getting overcomplicated with things like needing helm and needing to use Google's GKE or Amazon's EKS before trying it out. The overwhelming amount of "needed" knowledge really kept me back for a while. Until I thought well screw it I'll just install it on a couple of servers and see what happens, they're not doing anything and I have the time.
Going straight to the learn Kubernetes pages seemed like a good idea. There's a bunch of interactive tutorials with loads of information. Turns out, they teach you nothing (saying do step 1, 2, 3 and now you know it is a terrible teaching method, there's no why or what you're actually doing - you need to already know that somehow) and there's a massive information overload for someone like me that roughly understands docker and that's about it. I'm not really a network person. As a starter, honestly I just wanted to know how to actually install and connect a couple of machines together and run a container on the system.
Giving up on the Kubernetes documentation as beyond my current ability, I watched a couple of videos that ended up having the same problems - it's all well and good saying here's how to run a webserver locally, but then the next step to be to now deploy it across GKE/EKS just isn't helpful for me, that's not explaining how things connect together. No one seems to care about bare metal anymore. That's the opposite problem that the Kubernetes documentation had.
I ended up finding microk8s, which appeared to have some very simple install instructions, very few, descriptive commands and seemed like a great starting point. So with that, I rebuilt 4 of the server racks with ubuntu server, set one up as a "base" machine, ssh'ed into them all and connected them. Within a few minutes I had the Kubernetes dashboard up with 4 nodes connected. Dunno why everyone makes this stuff so complicated.
"Hello World" on Kubernetes
I think I went through quite a few different setups with microk8s, testing things out and resetting frequently. I never think this is a bad thing, so long as you learn something each time. My first "app" I decided to set up was a stats server & data store. This meant I could try out something I expected to be pretty easy (jobs), and something I thought would be a little harder (anything with storage - a Stateful Set).
My first step was to actually make the stats engine we have internally output to a database. I booted up an instance of mongodb with docker (my database of choice for this - purely because of simplicity, our data is already json formatted and database's aren't my strongest area), re-wired our output from a file to a better format and threw it at the database. Had it working within the hour, cleaned up and production ready (internal production, so a little looser), fantastic I thought. The engine already built into a docker image as that's how we alredy run it, and already had the idea of a "stats job" - do a bunch of runs, stick out the data.
With this ready to be deployed as a job, my next task was to run mongodb on the server, doesn't need to be anything special and it was so easy to create a container locally that just worked I thought this would be a doddle. Turns out it gets complicated very fast. Their documentation was, again, not great unless you're already confident with the tool or using it as a reference. They only seem to recommend using their operator and managing the system through that. Their "how-to" and basics documentation for this just links to a blog post from a few years ago, which makes it particularly hard to follow - after an attempt or two, I this idea got sacked off. I then found the helm charts by bitnami, which appeared to be simple, concise and was basically "here's how to run a basic instance, here's the config values and what the do" - perfect. I had it running pretty quickly, with the commands and configs in a shiny new repo. Easy to reproduce if it all collapses, easy to see what the config is, easy to explain to people.
Running the stats job was super easy. Running the job across 4 machines, then pulling the data out of the database all just worked. The networking was easy enough, although I did have some real problems with the hostname lookups, which refused to work, I ended up injecting the hostnames into every instance which then worked. The hardest part of my "first Kubernetes app" by far though was the mongodb setup.
Free runner minutes
I decided my next task was to setup gitlab runners. With in house runners that would expand as needed, we could happily build every commit of assets. The assets build step takes by far the most amount of time in our game builds. However, towards the end of a project especially, there just aren't that many commits to our assets repos. If the assets are pre-built, our game builds can take 1/4 of the time total. If game builds are quick and easy, developers aren't waiting around as long to ensure their build works and the whole flow comes together much better.
Connecting the instance to GitLab was easy enough, setting up the keys was fine, basically just following their how to. The bit that tripped me up next was their "one click" k8s applications. There's a big old install button next to all sorts of things you might want to install like the gitlab runners, Prometheus and a bunch of other fun applications I don't really understand (yet). Of course what they don't tell you on that page is that those buttons "install" these applications with a default config with helm, but don't that config doesn't really work, or are full of deprecated options. Cool. So after fiddling with that a bunch, I gave that up too and set up the "Cluster Management Project". This was basically just a super nice little config repo that GitLab hooked into. Fiddle with the config, add a flag, commit and you get a fully setup and installed application that links to GitLab.
Those runners then just hooked up easily, and would pick up jobs fine. It would however fail every job due to having no cache so - onto caching.
Minio caching
GitLab runners support a few types of caching, from Google cloud something storage, AWS S3 buckets, I think there's an Azure one and local doesn't make sense (need a central source for multiple machines). The best way to have it "locally" seemed to be to setup a Minio server, which basically emulates AWS's S3 - perfect, we already use S3 and I've had a fair amount of AWS exposure at this point, so using that API seemed great.
Setting up Minio was awful. For k8s they have an operator so you should just have to install their operator (through yet another third party installer - krew, which installs kubectl plugins), then run the commands through that and it should "just work". Ok, so after setting up krew and getting the minio operator setup, I went through their docs and setup the config as I thought we needed. It wasn't quite right, so rightfully failed to launch correctly - that's fair, my fault. But then I had to go through and manually uninstall EVERY step the operator had taken or the operator would refuse to install again. After quite a few attempts I ended up with the correct set of commands and setup the operator into the namespace I wanted, with the storage I wanted across multiple machines.
The operator however installed an old version of the actual MinIO server. Which meant it couldn't actually be used (can't remember exactly why, but the old version was super out of date and couldn't pick up part of the k8s config). Fine, I can change deployments and replica sets and all that manually, I'll just do that, set the image and be done with it. The operator changed it back. So at this point I was fighting a system that would detect my changes and undo them. I tried everything I could think of, to no avail. It was an open and known issue in GitHub though, with the reply being "we'll sort it next release, coming soon" posted over a month beforehand. I ended up basically just starting an instance, waiting for it to try update itself then restart the force stopped the old version. k8s then wouldn't close the new server because the old one had been removed. It was a hacky workaround that was pretty much just tricking k8s into keeping the one I wanted alive, and definitely wasn't reproducable. Should I have left that running? No, definitely not. Did I care at this point for a small part of a system that was internal to dev only, and didn't matter that much if it failed? No.
So MinIO sort of worked at this point, at least I could access it through the web ui and set up a bucket. The way it's set up means doesn't authenticate it's ssl certs with the k8s authority, so it didn't work with the gitlab runners. After following the recommended fixes (copying the certs to a shared space, reauthing the certs etc), it still refused to work correctly, but the host could be seen in the runner even if it didn't work, which meant the runner would look it up, fail to execute the caching but it would continue the job. Not great, but it'll do for now, and I'll come back when the tools are more mature.
Final Thoughts
At this point I had everything I wanted to run running, and didn't want to spend more time on this. I think overall my final thoughts are that Kubernetes is worth learning. Taken in small steps with a clear, small project in mind and limiting the scope of what you need to do at any one time, it can be easy to get to grips with and is worth the effort. The core concepts are relatively simple, it's the tooling and everything built on top of it that adds the complexity.