The HPC sector is showing enterprise IT what the public cloud can do: save gobs and gobs of money, while pushing the envelope on the scale of computation. But how likely is widespread adoption?
For a glimpse of the future of computing, the high-performance computing sector has always been a good place to look. In this age of cloud computing, HPC shops have been true to form, seizing upon cheap and plentiful public cloud resources to build ginormous compute clusters for short dollars -- and big results. In recent years, the worldwide HPC sector has been one of the lone bright spots in IT spending, estimated at $20.3 billion in 2011, and growing at a compound annual growth rate of 7.6%, according to IDC.
The availability of cheap cloud resources could propel the HPC sector still further, providing compute resources to a whole new class of users who were previously priced out, said Jason Stowe, CEO of Cycle Computing, a provider of HPC provisioning software. "Public cloud is democratizing access to utility supercomputing," Stowe said. "Before it was a have [or] have-not kind of environment." The trails blazed by HPC and public cloud could show enterprises with more modest needs how to take advantage of this new class of computing.
HPC in the cloud: The core of the matter
As with the TOP500, the yearly ranking of the world's fastest supercomputers, recent HPC cloud successes are nothing if not impressive. But whereas the TOP500 measures computing success in floating-point operations per second (FLOPS), cloud HPC projects tend to focus on cores.
Perhaps the most impressive project comes from computational chemistry firm Schrödinger, which created a whopping 50,000-core Amazon Elastic Compute Cloud (EC2) cluster to analyze 21 million drug compounds in just under three hours. The job -- a virtual screen that binds the compounds to a protein -- returned a list of more than 100 compounds that merited further testing.
That 50,000-core cluster featured many multiples more compute capacity than Schrödinger's internal resources, a 3,000-core cluster comprised of commodity 32-core nodes. "What we have available for a virtual screen is usually much less," which limits the scope of the research, said Rami Farid, Schrödinger's president. With smaller core counts, "you have to cut corners or it will not get done," he said.
In addition to the sheer size of the run, the Schrödinger job was notable for its low cost. Schrödinger paid AWS $4,900 per hour for three hours to run the job. Cycle Computing, which provided the cluster management software that enabled Schrödinger to export its jobs to the cloud, estimates the value of the EC2 infrastructure that ran the job at $20 million.
Other Amazon customers take advantage of spot instances to drive down the cost of running HPC jobs even further. Spot instances allow users to name the price they are willing to pay for unused EC2 capacity, and will run the job whenever the bid exceeds the current price, which varies based on real-time supply and demand.
A recent example is the University of Wisconsin's Morgridge Institute for Research, which completed more than 1 million core hours of gene indexing in just more than a week. The memory-intensive job required the use of relatively expensive high-memory EC2 instances. But because the job was architected to tolerate disruptions, researchers could specify the use of spot instances, which, at the time of the run, were priced at about one-twelfth the price of the equivalent on-demand instance. Ultimately, the institute paid just less than $20,000 to for the million-plus core hours of processing.
The parallel universe of HPC cloud workloads
Despite the enormous promise of cloud and HPC, not every HPC job is a fit for the model -- far from it. Using cloud for HPC workloads is still a new, albeit growing phenomenon. It's being used for a small subset of HPC jobs -- notably, the easiest ones, said Steve Conway, IDC research vice president for HPC.
By and large, HPC workloads in the cloud tend to be what the industry considers embarrassingly parallel -- large jobs that can be easily split into smaller ones, because they have few or no dependencies between them. It is therefore possible to run the jobs simultaneously on commodity hardware nodes that are not connected by high-throughput or low-latency pipes -- the kind of low-end infrastructure that by and large makes up the public cloud.
But most types of HPC jobs outside that cloud subset require communication between tasks. For example, to report the results of an intermediate step, and thus are usually run on systems that feature high-speed communication such as InfiniBand fabric between the processing nodes.
Thus, is largely true that the cloud is the best fit for jobs with a high degree of parallelism, conceded Cycle's Stowe, but "there's nothing embarrassing about it.... I like to think of them as pleasantly parallel," he said.
At the same time, cloud providers have responded by offering new compute resource types that can be a better fit for a broad range of HPC jobs. Amazon offers Cluster Compute, comprised of eight instances running on memory-rich Intel Sandy Bridge nodes. The Cluster Compute instances can also be located in Placement Groups that guarantee low latency, with 10 Gbps bandwidth between instances.
Likewise, Microsoft recently began offering so-called Big Compute nodes on its Azure public cloud. These come with eight instances and 60 GB of RAM, or 16 instances with 120 GB. Both run on Intel Sandy Bridge nodes with DDR memory, five 1 TB disks, 10 GbE for network and storage communication and 40Gb InfiniBand for internode communication.
If that still isn't enough, some traditional HPC vendors are also getting in on the cloud act, renting out access to optimized HPC hardware that provides better performance than typical public clouds. Examples include Penguin Computing's Penguin on Demand and Cyclone from SGI.
Part of SGI Cyclone's performance advantage comes from the fact that, unlike most public cloud offerings, it does not include a virtualization layer, said Franz Aman, SGI's chief marketing officer. "Having apps run natively on physical hardware gives you the performance you need," he said. Cyclone also provides Software as a Service (SaaS) access to specialized HPC software stacks such as NUMECA for computational fluid dynamics, or Gromacs for molecular dynamics.
These sorts of high-performance, integrated cloud offerings provide a stepping stone that can jumpstart a small company's use of HPC, or augment an organization's existing resources quite nicely.
Not all sunshine ahead for HPC in the cloud
Whilst some organizations have had great success doing HPC in the cloud, others maintain that it still has a long way to go. In fact, many HPC cloud services have proved to be disappointing, said one technical project manager at a life sciences startup, who is not authorized to talk to the press. After conducting an extensive pilot last year, she concluded that most mainstream public clouds are not a good fit for HPC, and that most traditional HPC software stacks don't integrate well with cloud resources.
She was disappointed, for example, by EC2's small instance sizes and the lack of InfiniBand connectivity. Further, "EC2 is too much like a data center—you need a sys admin with DevOps skills to get it to work, because scientists don't want to tinker with this stuff." And traditional HPC management software vendors have yet to nail cloud integration.
Internally, the firm uses cluster management software from Bright Computing, which she described as "very stable, robust and user friendly." However, an initial test of Bright Cluster Manager's cloud bursting feature, which is designed to let administrators launch new clusters in EC2—or burst existing clusters—didn't go well. "It's not plug-and-play yet," she said.
Then there's the issue of data privacy, said Dr. John Meyers, assistant professor and technology director at Boston University School of Medicine. The school currently manages several hundred terabytes of data, mainly in support of next-generation genetics and genomics research, plus cellular imaging, but that data is crunched exclusively in-house. "We're leery of the cloud because a lot of our data is identifiable patient data" regulated by laws like HIPAA [the Health Insurance Portability and Accountability Act], said Meyers.
However, if and when the school does resolve its cloud privacy and security concerns, it is in a good position, he said, thanks to the school's use of storage virtualization software from Actifio that manages data copies for backup, disaster recovery, and test and dev purposes.
"If we do go down that road, the fact that Actifio is application-sensitive will make it easy for us to say 'Shoot this data off to the cloud, but keep this data in-house,'" he said. Other storage management platforms that operate at the lower block level don't make it as easy to segment data by type.
Moving data to and from the cloud is another frequently cited HPC challenge, said IDC's Conway. The time it takes to transfer data to the cloud depends on the size of the organization's Internet connection, and, of course, the data set. Even with a low-end 1.5 Mb/sec Internet connection, simple math says it should only take 84 minutes to upload a 1 GB file. But a 1 TB data set would take over 58 days to upload at that speed, and a more reasonable—but still unwieldy—21 hours on a super-fast 100 Mb connection. "Getting the data to the cloud for the first time can stop people cold," Conway noted.
Some specialized cloud providers offer data upload services, and it is possible to speed up data transfers by sending files in multiple streams and by using compression. And of course, some organizations have their data co-located at or near the cloud provider's site, rendering these issues moot. However, for everyone else, it is imperative to think about the time it will take to get data to (and from) the cloud before jumping in. The waters may be too cold for some people's taste, but for a few hardy souls, it's just fine.
Comments