Google Search
Call Them Unreliable


“Just think about all of your personal information that’s stored on digital media,” says UTSC Computer Science Professor Bianca Schroeder (pictured right).

“Your family photos and vacation videos, your emails, your bank statements. Unreliable storage systems can lead to the loss of this valuable information. The goal of my research is to find ways to safely store it despite the fact that individual components may fail.”

Schroeder joined the CMS faculty in 2008 after completing her PhD and two years of post-doctoral studies at Carnegie Mellon University in Pittsburgh. She has already won wide recognition for her work on system reliability, both in academic publications – where she has received numerous best-paper awards – and on news sites such as Computerworld, PCWorld and eWEEK.

“I was amazed to find how little we know about why systems fail,” Schroeder says, “even though reliability has always been a key concern in designing them. As systems keep growing in size, that concern becomes even greater. Individual clusters in data centres now routinely include thousands of nodes. “Google, for example, is estimated to have several hundred thousand servers. That means even if failures of individual nodes are relatively rare – with a probability of, say, five percent a year – for many large-scale systems, there will be a node outage almost every day.”

In getting at the root causes of system failures, Schroeder was not content to rely on vendors’ assurances or the findings of lab experiments. She wanted to conduct field research on large-scale systems, producing insights that would help make future technology more reliable.

“This is very sensitive information that companies don’t like to share,” she explains. “Nobody likes to talk about things that go wrong. But with a lot of persistence I’ve been able to convince a number of organizations to let me collect and analyze data on their systems.”

Among those that have opened their data centre doors are the Los Alamos National Lab and companies such as Google and Network Appliances. The findings so far have been eye-opening for Schroeder:

“The failure behaviour of systems in the real world looks very different from what we’ve assumed for decades. We’ve found that for both hard drives and memory DIMMs [dual inline memory modules] – the most frequently replaced components in today’s systems – failure rates in the field are orders of magnitude higher than the numbers previously quoted in research or by vendors.”

The good news is that by carefully analyzing failure data, Schroeder and her research team, in collaboration with engineers at Google, have developed a method for protecting against system crashes due to errors in memory circuits.

Google is considering plans to deploy the solution across its systems in the coming year. “That’s the exciting thing about this area of research,” Schroeder concludes. “There’s an opportunity to really make an impact.”

© University of Toronto Scarborough