Microsoft goes public with more of its Azure capacity improvement plans


Credit: Microsoft

Microsoft officials have been providing updates on how the company has been working to increase cloud capacity since the start of the global COVID-19pandemic. On June 16, Microsoft shared some more specifics about what it has been doing on this front, including some information on how it has been endeavoring to shore up its Azure-based Teams service as demand grew sharply starting this spring.

Officials had already talked about Microsoft’s prioritization of demand from first responders, healthcare workers, and other front line workers. They had shared details about some of the less-essential services they throttled. And they also had publicly acknowledged that supply chain challenges led to a shortage of some needed datacenter components, further contributing to issues meeting some cloud demands.

Today, officials said Microsoft datacenter employees have been working in round-the-clock shifts to install new servers (while staying at least six feet apart). Microsoft added new servers first to the hardest-hit regions and installed new hardware racks 24 hours a day.

They also said Microsoft doubled capacity on one of its own undersea cables which carry data across the Atlantic, and “negotiated with owners of another to open up additional capacity.” Engineers tripled deployed capacity on the America Europe Connect cable in two weeks, they added.

At the same time, product teams looked across all of Microsoft’s services running on Azure to free up more capacity for highly demanded services like Teams, Office, Windows Virtual Desktop, Azure Active Directory’s Application Proxy, and Xbox, officials said. And in some cases, engineers rewrote code to improve efficiencies, as they did in the case of video-stream processing, which officials said they made 10 times more efficient over a weekend-long push.

Teams was made to spread its reserved capacity across additional datacenter regions within a week, rather than the multiple-month-long process that such a strategy would entail, officials said. In addition, Microsoft’s Azure Wide Area Network team added 110 terabits of capacity in two months to the fiberoptic network that carries Microsoft data, along with 12 new edge-computing sites to connect the network to infrastructure owned by local Internet providers to help reduce network congestion.

Microsoft also moved its own internal Azure workloads to avoid demand peaks worldwide and to divert traffic from regions experiencing high demand, officials said. On the consumer side, Microsoft also moved gaming workloads out of high-demand data centers in the UK and Asia and worked to decrease bandwidth usage during peak times of the day.

Microsoft also has had to update its forecasting models that took into account the major uptick in cloud demand resulting from the pandemic. Microsoft added to its multiple predictive modeling techniques (ARIMA, Additive, Multiplicative, Logarithmic) some basic per-country caps to avoid over-forecasting. It also tuned its models to take into account inflection and growth patterns by usage per industry and geographic area, while adding in external data sources about COVID-19’s impact by country.

“Throughout the process, we erred on the side of caution and favored over-provisioning-but as the usage patterns stabilized, we also scaled back as necessary,” officials said.

Microsoft also learned some lessons by scaling out its compute resources for Teams. By redeploying some of its microservices to favor a larger number of smaller compute clusters, the company was able to avoid some per-cluster scaling considerations, speed up its deployments and obtain more fine-grained load-balancing, officials said. Microsoft also got more flexible in terms of the type of VMs or CPUs used to run different microservices so it could focus on overall compute power or memory to increase Azure resource use in each region. And engineers were able to optimize the service code itself in ways to reduce things like CPU time spent generating avatars.

Microsoft added new routing strategies to leverage idle capacity. Calling and meeting traffic was routed across multiple regions to handle surges and time-of-day load balancing helped Microsoft avoid wide-area network throttling, officials said. Using Azure Front Door, Microsoft was able to route traffic at a country level. And it made a number of cache and storage improvements, which ultimately helped achieve a 65% reduction in payload size, a 40% reduction in deserialization time, and a 20% reduction in serialization time.

Microsoft also tweaked its incident-management policies. It switched out incident management rotations from a weekly to a daily cadence. It brought in more incident managers from across the company and deferred all non-critical changes across its services.

All of this cloud-capacity scaling will have an impact on how Microsoft builds and maintains its Azure-based services, like Teams, officials said.

“What we can do today by simply changing configuration files could previously have required purchasing new equipment or even new buildings,” said officials in today’s blog post.

On the Teams’ futures front, Microsoft plans to transition from VM-based deployments to container-based deployments using its Azure Kubernetes Service. Officials said they expect to minimize the use of REST in favor of more efficient binary protocols like gRPC. 

 Reports of existing Microsoft customers hitting Azure capacity limits in certain regions didn’t just start when more people began working from home during the COVID-19 coronavirus pandemic. Last fall, a number of East US2 Azure customers were reporting they couldn’t even spin up virtual machines because of Azure capacity issues. I’ll be interested to see if these new capacity improvements will head off new/future reports of general Azure capacity issues as the pandemic runs its course.