Notification System Down
Incident Report for Sociohouse
Postmortem

Notification System Down: Postmortem

When Sociohouse was like a a new project, with a far smaller audience, we had an opportunity to test out different server mechanisms so that our main instance isn’t pressurised, so…. we decided to turn the notification system into it’s own server, and test out this new mechanism, the day after we implemented this system, firstly, it was instant regret, and secondly, we saw 43 new users join that day, the most we’ve ever seen(thank you!)

Since then, we’ve seen about 12 suggestions every single day in our inboxes, tweaks, bugs, new features, etc. our backend engineers have been busy implementing other features, etc. so we kinda left the notification system untouched since then. Until yesterday.

Our Backend Engineers saw about 15 new bug reports complaining that the notifications weren’t working, and there it was… our notification servers were running but there was a massive data loss in the process of transmission from one server to another…. which kinda sucks. Since we’re only a team of 4 at the moment, with 1 backend engineer, the whole team kinda rushed into deploying another instance, trying fixing the issue, etc.

That didn’t work, until, one of the team member, redeployed the code on the same instance, without changing anything, the system started working again… what we think the issue was that our RabbitMQ Queue, which was running on a node instance wasn’t running on the receiving end, I believe this is because that code was untouched was long, etc. we added an if statement that prevents the RabbitMQ Queue booting off randomly.

That was one of the biggest technical oops of the month, that’s what we believe the whole downtime was about.

Posted Aug 16, 2021 - 12:21 AEST

Resolved
the issue has been resolved.
Posted Aug 15, 2021 - 23:13 AEST
Monitoring
fix has been implemented. we're still monitoring.
Posted Aug 15, 2021 - 23:12 AEST
Investigating
another major outage
Posted Aug 15, 2021 - 23:09 AEST
Identified
the issue has been identified, a fix is on its way.
Posted Aug 15, 2021 - 23:03 AEST
Update
We are continuing to investigate this issue.
Posted Aug 15, 2021 - 22:54 AEST
Update
We are continuing to investigate this issue.
Posted Aug 15, 2021 - 22:53 AEST
Investigating
We are currently investigating this issue.
Posted Aug 15, 2021 - 22:48 AEST
This incident affected: API and Apple Push Notifications.