An unplanned business continuity test, and what I learned from it

I did something stupid this week. I should probably hang my head about this, because, frankly, I shouldn’t have got myself into this mess in the first place.

But you know what? I’m not hanging my head.

As this unexpected exercise showed - I must have got something right in terms of the bigger picture, because recovering from my own cock-up was not just successful, but it was quick, easy, and without stress.

Yes, I’ve learned some lessons. But I’m pleased that the backups and continuity steps I have in place mean that, in essence, I can do stupid things, within limits. I can fail, and not be too badly off for it.

And I’ve learned from it anyway.

What I got wrong

tl;dr: I locked myself out of our a work system after a system update, with no obvious way of getting back in.

I have a program of work to move our servers from Debian Buster over to Debian Bullseye, as Buster is getting EOL’d later this year. I wanted to do it well in advance of that, to be on the safe side. And I’ve been using Bullseye for a while now.

I’d done a couple of test upgrades, and they worked just fine.

I’d done a couple of real upgrades, and they worked just fine too, tested over two or three months.

So, when I came to do the upgrade on this particular machine, I was not expecting too many problems. There might be bits to sort out - there nearly always is - but nothing too fundamental.

What I had not tested - my bad - was upgrading on another similar / identical system: a Raspberry Pi running Raspberry Pi OS, with LUKS encryption and unlocking over SSH.

So, when I did the upgrade, and rebooted the Pi, I could ssh in to the pre-boot environment, and I could see the prompt to enter the LUKS decryption passphrase (askpass), but it was not accepted.

I am assuming that it is an initramfs problem, but I don’t know exactly what is wrong. It’s not a LUKS problem, because I can mount and unlock the LUKS partition manually. And it’s not an ssh issue, because I can ssh into the pre-boot environment.

But I can’t unlock the LUKS partition while the microSD card is still in the Pi, to enable the machine to boot from the unlocked partition. Aaargh.

I spent a fair amount of time learning my way around unpacking and rebuilding an initramfs, to try and sort it out, without luck. This is still bugging me, as I don’t know exactly what I’ve got wrong or how to avoid it in the future. More Pi-based testing needed.

So I was left with a Pi which will not boot from its microSD card, but with all the data intact on that microSD card, still secured in the LUKS-encrypted partition.

Not a terrible position to be in - no data loss, and no impact on my work - but it wasn’t working either.

What I got right

Tried and tested backups

Every time I deploy a new system, I write and test backup scripts, and test restoration of the backup, before I “go live” with it. Pretty much for exactly this situation: I want to know that I can recover from my own errors, or from a microSD card corruption, or from something borking my servers, or from someone stealing the server.

That was a good position to be in. I could have mounted the microSD card and extract the data from it (it was cleanly shut-down, so there should not have been any corruption issues either).

But it was easier and faster just to put a new image on a microSD card, and restore from the backups.

Leaving aside the time I spent tinkering with initramfs - yesterday evening, but it was only because I wanted to see if I could rebuild it and fix it - time to fix was about 30 minutes.

(This is also why it makes good business sense for me to have two laptops, kept in sync: if one breaks, I can pick up the other, and keep going, with minimal downtime.)

An annual business continuity test

Every year - occasionally twice a year, work depending - I pick a system, and I break it (or at least pretend to break it; I don’t actually need to break a Raspberry Pi to pretend it is broken). More often, I simulate power outages, or fibre breaks / problems.

In an ideal world, I’d break more systems on a more regular basis, but hey ho.

Doing this

What I will do differently

Frankly, I shouldn’t have been in this position, and I am kicking myself. But I can take some useful learnings from it:

I should have done a test on a mirror system

I relied too much on positive experience doing the same thing but on different systems.

Sure, knowledge can be transferable, but I am not short of Raspberry Pi machines or SD cards, and it would have been even easier to have done a test operation on a test system, where there was zero risk associated with a failure.

Creating an identical system would have been straightforward: I could have cloned the microSD card.

I should have cloned the microSD card first

Since I am hosting most things on Raspberry Pi machines, what I should have done, for a major update, was to shut the machine down, clone the microSD card, and do the upgrade on the new image.

That way, if the upgrade goes as badly wrong as this did, I could have just switched SD cards, back to the one which I know was working, and everything would have carried on as normal.

Irritatingly, had it been a VM, I would have created a snapshot first, so I could roll back.

More fool me, but a useful learning.

I should have included my backup script in my backups

One thing I had not backed up was my backup script.

Not a big deal in this case, as it is not a complex script and I have similar scripts on other machines on which I could draw. And, yes, I tested that it worked, and that I could restore from it.

It would have been easier if I’d backed up my own backup script, and the associated cron entry.

Do I have constantly-mirrored systems, so I can swap immediately from one to another?

I’m not sure. That’s twice the power, twice the maintenance (even if this is mostly automated), twice the number of IP addresses, and so on. But perhaps it’s worth doing.