Microsoft CEO Satya Nadella and Cloud and Enterprise EVP Scott Guthrie used an event in San Francisco today to give an update on its Azure cloud platform, launching its on premises Cloud Platform System and the Azure Marketplace, where ISVs and developers can sell services and applications to Azure users.

Noting that Microsoft's cloud is part of his productivity vision for the company, Nadella noted that "Productivity and platforms, that's our core, the soul of our company [...] The Microsoft cloud is the most complete cloud offering: used by businesses across every industry in every geography." With a revenue run rate of $4.4 billion, and with 80% of the Fortune 500 as customers, Microsoft's cloud is starting to justify the $4.5 billion of CAPEX.

Interestingly Nadella also pointed out two key figures, first that "20% of Azure is running Linux, there will be first class support for Linux on Azure." Secondly he stated, "Over 40% of Azure revenue from ISVs and startups, it's the best road to the cloud for startups." With Microsoft launching Azure Marketplace for software that runs in and on Azure, with support for any OS, any service, and any application, it's clear that Microsoft intends to build on a growing ecosystem, giving a startups a new outlet for their products and services.

Guthrie followed up unveiling new high capacity server images for working with large amounts of data, and a new high capacity storage tier, giving Azure users the option to scale up as well as scale out. However the key announcement was the company finally delivering on its Azure-in-a-box promise, with the launch of the on-premises Cloud Platform System.

Microsoft has been running hyper-scale cloud services for many years now, with at least five generations of hardware design under its belt. But the hardware is only part of the story, the rest is software: the hypervisor, the management tools, and the automation. The result is everything Microsoft needs to build and run a software defined data center at scale. Now it's time for the tools and technologies that run Microsoft's cloud to come into your data center.

While passing through Redmond a week or so ago, we had the opportunity to spend some time with the team responsible for delivering Microsoft's Cloud Platform System (previously codenamed "San Diego") and to see it in action.

Microsoft has long promised to deliver "Azure in a box". We've seen elements of that promise in hardware from HP and Dell, in the private cloud and orchestration tools in the latest System Center releases, and in the free Windows Azure Pack. While all the elements have been available, they've never been wrapped up into one product and with one point of contact for support. That's what the Cloud Platform System offers: a complete software defined data center in a box, or at least in a rack, with everything you need to run a private cloud.

According to Vijay Tewari Principle Program Manager for CPS, Microsoft's customers are asking "I want my cloud and I want it now", but almost 80% of the resulting private cloud projects are failing. That's because customers were looking for what Tewari called a "magic cloud", where everything just happens. They weren't expecting all the work that's actually needed to run a cloud.

There's a question of what a private cloud should be, and what it should deliver. That's meant working with Microsoft's public cloud services, Bing, Hotmail, and Azure, to learn just what's needed to build and run a cloud service. The CPS team also considered the trends that were driving businesses to using their own cloud services. At the heart of things was the need to offer consumer grade services, as their users were familiar with services like Facebook and Gmail, and were expecting the tools they used at work to work the same way. But existing processes and technologies got in the way of IT departments, blocking them from responding in an agile way.

Users are expecting the same level of service they get from the public cloud, where when Gmail goes down, it is headline news. They want self-service portals, something Tewari described as being somewhere, "I should be able to walk up to it to tell it to do things in the way I want them." Sure, it's instant gratification, but that's what users want.

There's also a deeper set of requirements, where the move away from the traditional separation of developers and operations means that developers want programmatic access to infrastructure with APIs to help them handle scaling issues or server deployment.

Putting all that together means that Microsoft needs to deliver a system that offers what Tewari calls "a validated system architecture, from top to bottom; from the individual disks we use in storage, all the way up to the portal we expose to the customer." It's a very different way of working from Microsoft's traditional customer relationship, as it means delivering a lot more than software, with hardware that has a supply chain behind it and that's aware of failures, while still able to continue operating. After all, if there's one thing we know about IT, it's that things are always going to fail, whether hardware or software.

As Tewari says, "Humans get in the way, they're the least predictable thing in any system. You need to automate your way out, don't depend on a human reading a set of instructions." He describes the result, a resilient system that's deployed in a known boot state and configured in accordance with best practices, as one where "Any time you touch the system, you do it through automation." The aim is to keep the system in the state in which it was initially booted.

He walked us through how such a CPS-based private cloud would handle updates. "We provide a validated set of updates through a framework that orchestrates up from firmware BIOS and NIC BIOS, all the way up the stack. We start by applying patches to a Hyper-V host - we drain it, patch it up, and rehydrate all back when patching completed." That process works for the entire stack, in an orchestrated fashion. By stack, Tewari means everything from hardware to workloads. Microsoft has validated key business applications, like Exchange, Lync, and Sharepoint, and given them best practice automation. 

A cloud needs one point of contact, and so Microsoft is offering CPS with unified support. Microsoft is the place you'll make the first call, and it will then take responsibility for ensuring that your support call is handled by the appropriate experts. You will need separate support for elements that aren't part of the CPS, so if you're running Linux client VMs you'll need to work with your vendor however a consistent platform will make things easier for them to provide appropriate support.

So what goes into the Microsoft Cloud Platform System? Tewari describes it as "an Azure-consistent cloud in a box, where the box is fairly big." You're not getting the complete Azure set of services in CPS, but you are getting the tools and features that let you deliver much of its infrastructure as a service, along with some platform elements. Microsoft has worked with Dell as its key hardware partner, for blade servers, for storage modules, and for networking. CPS's software comes from familiar tools, Windows Server 2012 R2 and the Hyper-V hypervisor, System center 2012 R2, and the Windows Azure Pack. As Tewari notes, "There's no special software sauce - the way we orchestrate is where you get value."

Talking about delivering the CPS, Tewari is expansive, "We've been working on this for the last 18 months, and it's taken a lot of pain to make sure we get it right. The key learning for the product team has been that the cloud is not about an individual components, it's about the entire stack working for the customer." He describes how the team worked, in one "ship room", telling us, "I have never seen a  ship room like this, with all kinds of participation, from core OS folks to the network team to the storage team, to clustering, to the System Center team, and the Azure Pack team. We were getting all these people to collaborate and fix things for the customer, a huge benefit for us." There was also work with product teams, including SQL Server and Exchange, building support for automated deployment.

So what do you get with the CPS? The key component is its portal, built around the Windows Azure Pack Portal. That's where the Azure-consistent UI comes in, with administrator and tenant portals. Administrators can just deploy services for users, or they can construct cellphone-like plans that can be used to manage services for different business units and classes of user.

Mark Jewett, CPS's Director Product Marketing, notes what's different from other software defined data center platforms is, "We're not focused on virtualization, we're focused on delivering a cloud way of working. [...] It's really about that cloud model, where you provide end users with an environment that they feel have control."

CPS is clearly a complete software defined data center solution: all the infrastructure is deployed as virtual machines, running on bare metal hypervisors, with virtual networking. The only physical element is the storage stack. There's no need to touch the physical networking, as everything is handled via the Hyper-V virtual networking tools, allowing users' virtual infrastructures to be isolated from each other.

So what goes into the CPS? It's a hefty beast, with the minimum unit a single rack of compute, networking and storage units all packed in tightly. There's plenty of capacity, with a full four rack setup you can run up to 8000 virtual machines, and work with 0.7PB of storage space (using Microsoft's SMB 3.0-based Scale-out File Server). It's a fault tolerant system, at every level from basic hardware up. Once in place there's nothing you need to do, it's all preconfigured and predeployed, ready to go.

Each CPS unit is based on a standard 42U rack, with 512 cores and 8TB RAM, with 262TBof usable storage. There's 160GB/s of internal connectivity, with of 20GB/s external networking. It's not light, weighing 2322lbs and you're going to need 16.6KW of power. Storage is in 4 storage servers, with 4 JBODs, and the compute servers are Dell PowerEdges with dual socket IvyBridge-class processors.

Tewari notes that this is all based on what Microsoft runs internally. "We're extreme dogfooding," he says as he describes Microsoft's Nebula, the codename for its internal cloud. Inside Microsoft the Nebula cloud runs over 90 thousand VMs, and it sets up and tears down 20 thousand VMs a day.

CPS builds on existing Microsoft clustering tools, with a rack offering 32 separate compute nodes; 24 of which can be used for your own workloads. The remaining 6 nodes are used to handle management, running the System Center tools for virtual machine, configuration and cluster management. There's support for Azure Site Recovery, so you can use the public cloud to orchestrate disaster recovery using the runbooks that automate CPS operations.

There's no need to work with the low level software in CPS. Like any cloud, it's best thought of as an appliance (a very large appliance that weighs as much as a herd of small elephants, but an appliance all the same). Everything is delivered through portals, even the information needed to handle operational management. Users get access to a gallery of workloads, which deploy using System Center orchestration runbooks to ensure as much of CPS operations as possible are automated.

Microsoft is aiming to make operating CPS as easy as possible, and its test program has certainly put the system through its paces, simulating a year's worth of damage in a week in order to stress test the hardware and software. In order to do so the product team automated heavy workloads, with test teams behind the racks physically removing power supplies. The resulting stress exposed problems all the way down to drive firmware level.

It's an approach that Spencer Shepler, Principal Program Manager, talks of as "We're raising all boats. All the bugs we have found go to all System Center and Windows Server customers, as all fixes are public." Similarly the CPS team tests updates and gallery items before making them available. "We create a package, validate and then test it on Nebula before deploying it to customers." That's not just the OS, or the applications, it's everything in the deployment pipeline, right down to drive firmware.

The complete specs for a single CPS rack are:

  • Networking:  4 x Force 10 S4810P and 1 x Force 10 S55
  • Compute Scale Unit (32 x Hyper-V hosts): Dell PowerEdge C6220ii 4 Nodes per 2U, with dual socket Intel IvyBridge (E5-2650v2 @ 2.6GHz), 256 GB memory, 2 x 10 GbE Mellanox NIC's (LBFO Team, NVGRE offload), 2 x 10 GbE Chelsio (iWARP/RDMA), 1 local SSD 200 GB(boot/paging)
  • Storage Scale Unit (4 x File servers, 4 x JBODS): Dell PowerEdge R620v2 Servers (4 Server for Scale Out File Server, Dual socket Intel IvyBridge (E5-2650v2 @ 2.6GHz), 2 x LSI 9207-8E SAS Controllers (shared storage), 2 x 10 GbE Chelsio T520 (iWARP/RDMA), PowerVault MD3060e JBODs (48 HDD, 12 SSD), 4 TB HDDs and 800 GB SSDs

It's a dense unit, and can be deployed in groups of four racks. Each group is a separate administrative unit, with storage mixed between triple-mirrored workload storage and double-mirrored backup using deduplicated storage. Set up time, from delivery, should be about four days, with a white glove service from both Microsoft and Dell. Aside from Microsoft's Nebula cloud, four customers are using CPS in a private preview at the moment. General availability will be at the beginning of November.

Shepler notes that the "Economics [of CPS] beat much of what is out there. It's going to be at a different fundamental price point, as we have the advantage of scale purchasing." Microsoft initially planned on using a single server design across CPS, but ended up using two to get the best possible performance from the storage network elements. "We were looking at industry standard, but reducing the number of different components for field replacement. Our approach allows users to stock replacements on premises and in the OEM supply chain."

In his presentation Guthrie reiterated Microsoft's view that there will only be three at-scale cloud platform vendors: Microsoft, Amazon, and Google. With today's announcements it's clear that Microsoft intends to be the largest of those three, encompassing all aspects of the public and private clouds. It's a big ambition, now Redmond needs to deliver.