Sometimes it can be hard to know what to do when everything is going wrong.
It tends to be quite helpful to remember what people actually care about:
- An apology for the symptom
- What's happened? (the root cause)
- Steps you're taking to stop it happening again.
This super simple template is one of my goto tools for when things go wrong. It helps calm people down, and switch them from angry/upset to calm and educated.
When things go wrong, depending on how long it takes to fix, I'll send affected parties something similar:
- Something's broken!
Widget A is broken.
The first report we got was 15 minutes ago from a customer (Acme Inc.). The Widget was updated as part of our new release yesterday, so it's possible there's a connection.
I'll keep you posted on the status of the Widget and our potential fixes.
- Fixed it!
Happy to tell you that the Dev team have successfully found and fixed the cause of the issue. From first report, it took 32 minutes to have a fix live and confirmed by the customer.
The problem was caused by a flaky connection to one of the databases, causing messages relating to Widget A, Widget D, and Widget F to be delivered only 30% of the time.
We've enabled a temporary fix in the code which stores failed queries when the database connection is down, so all messages arrive, even if things are broken at the time the Widget is used.
Moving forward, we're adding tests that should help us detect this kind of issue, along with improving our auto-reconnect and write-cache code.
Thanks for your patience,
- Latest Go-live
Some might remember that last week we had some issues with Widget A, D, and F.
At the time, we implemented a temporary fix. At 1300hrs today we'll be releasing the permenant fix. We'd all appreciate if you could have a play to ensure that we've totally fixed the issue.
If this is something you do, or you have a different technique, let me know on Twitter: @jreeve0