Sublime Ads - Downtime report (2022-11-13)
So sorry for the unexpected downtime today (I hate downtime!). Long story short, here are the series of events that caused this, and how I fixed it:
- As part of the general application framework updates today, I also wanted to re-size the server instance. This was due to the fact that it was running on a quite large server that I did not need. Deployments used to be slow due to assets having to compile on deploy, so the larger server was used to accommodate that. However, the framework upgrades make this redundant... as I am moving away from webpacker and upgrading the project to use Propshaft and import-maps (at a later stage).
- I resized the server without issue, however immediately on boot the application was throwing an internal server error.
- I tried a re-deploy of the application, to see if something has gone amiss. No dice.
- Investigation showed that the internal server network has somehow corrupted both Redis and Postgres connection strings, now no longer connecting as intended. The network was set up by the service provider that provisions the servers on my behalf (I hate server management). Unfortunate for me, out-of-hours support was not possible to help me try and resolve this.
- Going to my server provider I deleted the internal network interface, and re-created it. The server provisioning provider picked up on this fact, and told me to re-deploy so it can re-create the connection strings and other variables for the changed network.
- No-go. All still failing.
🛏️📖 Bed and story time for my daughter.
Stressed me, returning after 45 minutes:
- I have a fail-over group, that should allow me to fail over to another server. Unfortunate for me, the internal network was screwed.
- Tried then to spin up a new server, with a clone of the application so that I can point the "failover" to a new primary server.
- That did not work as it couldn't find the network ID it created initially on the server provider (which I deleted in the step above). Damn!
- Enabled "maintenance mode", which shows a placeholder message.
At this stage, I downloaded the latest database backup, which runs every hour and is available to me at any time I so require. I didn't want to take any chances. Of course I couldn't do a backup right there and then, because the network was... fucked. Even SSH'ing wasn't playing ball. Thankfully the last backup was recent.
Where was I?
- Spin up a new application, cloning my existing application settings, as a new project. This takes time! A long time, as it has to provision new servers and everything else. Oh the joys of waiting.
- Whilst waiting 20 minutes for a brand spanking new server to provision, I decided to also provision the application on another server of mine that is the new way I want to do things - which actually runs shoutouts.lol. Of course this is a bit more involved, like copying across all the secrets, SSL certs and all that good stuff.
- After 20 mins of waiting for the other server, it told me it failed. OK, so try again I guess, which starts the whole process again. Another wait.
- Back to my nice server, everything seems to have been deployed, but got one failed attempt as I hadn't given it access to the encryption keys for the database used for Sublime Ads (I hate the fact I did it like this in the first place...). OK, added, deploy again.
- I set up a new hostname to point to the "nice" server. The application was up and running, and it was working. Database backup restore next...
- The other server was still provisioning.
- I wanted to restore the database using my GUI client, TablePlus, but it didn't want to play ball, so I SSH'd into the server and just did it via the CLI. Super easy once I figured out that I can put a full psql url in there for authentication!
- Another notification from my other server saying deploys have failed... urgh. Complaining about encryption. At least I know what that was about. It needed the keys to encrypt and decrypt the database.
- I hit up the new subdomain of my "nice" server that is serving my application, and I can see good things. It seems to be up and running with the data intact.
- I run a few manual tests, like account creation, ad creation and so on, just to make sure it's not running into any issues with encryption (I had that once). But everything was A-OK.
- I give up on the screwed servers.
This is where I am happy I choose Cloudflare as my DNS host. Changing A records would have caused even more downtime than I wanted (because it had none for the root domains - see next sentence), but the subdomain was already resolving nicely. The problem was that the root domain was pointing at a fail-over group CNAME, which means it could direct traffic as needed without DNS config. Deleting the CNAME was not an option for me, as that would certainly cause even more disruption.
So, really long story short, I pointed the CNAME to the subdomain... and BOOM, Sublime Ads was up and running again.
Cloudflare proxies requests, so this was an instant switch.
What have I learned? Don't do this close to your kid going to bed.
And the second lesson is, if your server provisioning provider makes stuff really complicated, don't go for them because it's a huge blackbox...
Now Sublime Ads runs with Puma behind Caddy, and I am happy. Not because it works, but because it's easier for me to manage.
An unexpected change that I did plan (to move to the "nice" servers) at some stage, but not today... I certainly would have done it without the downtime... but here we are.
Whilst the app was down, there might have been a few missed tracking events, like Views and Taps. I am making plans to make this more robust and move all this to the edge network which will queue up anything that could not hit the main servers.