Microsoft's latest incident affected Azure, Teams, and Outlook for hours.
Microsoft recently released their postmortem. I applaud Microsoft for publishing this. Could your company have done such a thorough job so quickly?
As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables. During this re-computation process, the routers were unable to correctly forward packets traversing them.
Maybe your network isn't so large that this "re-computation process" wouldn't saturate your network equipment.
Regardless there is a learning here.
Due to the WAN impact, our automated systems for maintaining the health of the WAN were paused, including the systems for identifying and removing unhealthy devices, and the traffic engineering system for optimizing the flow of data across the network.
Their network management system, including device security, ran ACROSS their network. So when the network was impacted their network management system was ineffective. Basically, Microsoft had to watch and wait for the network to settle down.
A side channel network management solution would have mitigated that. And introduced a myriad of other problems, principally security.