Micro service architectures result in up to 20 times larger environments than their monolithic counterparts. In such big and interconnected environments container metrics will tell you about infrastructure health but not service health. Even if you have implemented service health checks to quickly react on service failures, in a resilient system (like built on top of Mesos/Marathon or DC/OS) you will see intermediary mushroom cloud effects of a large number of services being affected temporarily. The mushroom cloud shows you all services, containers and hosts being affected by a failing component. How do you find out what really caused the problem and how to distinguish effect vs. cause?
In this session Alois will do post-mortem analysis by walking through different cases of failures we've observed in a real-world large e-commerce production environment running on Apache Mesos and show you how to figure out what actually caused the failures.