The system has failed / Lessons from the backend

Full Featured (30 min.)

We thought we had quite an error-proof distributed system at Soluto, at least one that is safe from a major catastrophe. After all, we applied every best practice: micro-services, auto-scaling, automated monitoring and using feature flags. If one service fails, others won’t be affected. But we were wrong, and I accidentally shut-down Soluto's backend with one simple feature for our users. Join me to hear how, but more important - the pitfalls and the possible solutions that could have prevented the situation or reduce the damage, like back-pressure patterns, responsible feature release process and disaster recovery. Basically, things you can do to improve your backend's resiliency.