# incidents --- # goals * fast MTTD (mean time to detection) * high MTTF (mean time to failure) (uptime) * high MTBF (mean time between failures) * low MTTF (mean time to recovery) --- # goals * less panic * less people doing things all at once * more coordination * more (and clearer) communication --- # roles ``` -------------------- |Incident Commander| -------------------- / \ / \ [Tech Lead] [Communications Lead] / \ other engineers ``` --- # incident commander * does not fix the problem (but has some knowledge of the system) * keeps up to date with everything that's going on * handles internal (engineering) communication --- # tech lead * fixes the problem * keeps incident commander updated * runs major changes by the IC --- # communications lead * external communications (link to marketing, CxOs, etc) * keeps IC updated with what stakeholders/customers are saying --- # severity if everything is a P0, nothing is a P0 --- # severity * **P0** * _everything is on fire oh god_ * maybe multiple engineers under tech lead * communications lead necessary * **P1** * _one thing is on fire_ * maybe just tech lead on engineering * communications lead necessary if customers/stakeholders affected * **P2** * _something looks like it might catch fire soon_ * probably no incident commander or communications lead * **P3** * _look at this sometime next sprint_ * definitely not an incident --- # how can we do this? --- # games * chaos engineering * chaos monkey * manually just killing stuff sometimes * bad stg releases * bad prd releases --- # tools * victorops (on call should matter!) * newrelic (we pay them a lot of money, let's use them better!) * logging (sumologic -> efk) * chatops (we all live in slack anyway) * documentation (how should incident reports work? runbooks?) --- # restrictions if there's only one process for making changes in production, no one has to wonder if they're doing it right. * no access to boxes in prod * no changes through the AWS console or CLI * everything goes through CI * services * client applications * db * infrastructure (terraform, kubernetes) * no cheating * if the tests don't pass, it doesn't go out * having 0 tests and a green build doesn't count --- # training * on the tools * on the process * on other teams (cross-training — limit the bus factor) --- # less incidents * better autoscaling * better CI/CD process * move pieces to different layers (example: rate limiting close to the edge) * circuit breakers --- # less "incidents" * if it will self correct (scaling) * if it has nothing to do with us and we can't fix it (someone on IE 8) * if it's an exception we can just expect to happen sometimes ### _then it's not an incident_