Data scientists and business bigwigs discussed the thorny logistics and ethics of big data analytics last week at MIT, and whether this type of data belongs in the public cloud.
When a panel of two data scientists and a business expert here were asked about cloud computing's role in the big data trend, some expressed skepticism about its trustworthiness.
"We will continue to invest in internal infrastructure," said Claudia Perlich, chief scientist for media6degrees, a marketing research firm based in New York. "Our customers don't necessarily trust cloud providers, and we like as much control as we can get."
She said it is in the company's best interest to invest in its own infrastructure, even though it may cost more than using a public cloud service. A company that's less focused on data processing at its core, which just uses big data for decision support, might feel differently, she conceded.
But more and more businesses have begun to run big data operations in the cloud, according to MIT conference attendee Mike Olson, CEO of Cloudera, an Apache Hadoop-based software and services business in Palo Alto, Calif. Not necessarily even because it's cheaper -- in some cases it's not -- but because of the flexibility of scale, he said.
In fact, there are a lot of startups in the big data space that were born on the public cloud and find it a better way to keep infrastructure and management costs low, said panelist Tom Davenport, a professor at Harvard Business School and Babson College.
This is also the reason for the prevalence of open source tools in the big data market currently, Davenport said. But he added he'd be surprised to see large enterprise companies running big data operations on Amazon Web Services' public cloud.
Another panelist, Rachel Schutt, a senior statistician at Google Research, was sanguine about private cloud deployments in support of big data projects. She pointed out that big data usually won't fit on one machine, requiring a scale-out approach to computing in which multiple models run across multiple machines, which can be scaled quickly. Google's big data research runs on an infrastructure it hosts itself, of course.
One thing the panelists agreed upon is the need to educate the next generation of data scientists on the ethics of what they do. A later presentation by MIT Media Lab professor Alex "Sandy" Pentland detailed his work with the European Union as well as the United States on a Data Bill of Rights. "It has to be you that controls data about you," Pentland said.
But there are tradeoffs to consider when it comes to privacy versus the public good and the use of advanced big data techniques to analyze information such as health records and user behavior to stop the spread of disease, for example. With end users' permission, more invasive big data analytics might be helpful, Pentland said.
Comments