Science Wednesday: OnAir – Huge Datasets Pose Challenges but Hold Promise

During a recent visit to Harvard, I sat down with Francesca Dominici, a biostatistician and former director of the Johns Hopkins Particulate Matter Research Center.

Dominici confessed that she has spent much of her time at Harvard thus far figuring out how to transfer, store and manage all of the data that has accumulated over years of research.

How hard could it be to move data, I wondered?

Her projects at Hopkins included a national study showing hospital admissions and mortality associated with exposure to air pollution particles.

“We’re using all data on particulate matter and particulate matter composition for every single monitoring station in the United States from the first date it has been available up until 2007.”

This includes years’ worth of ambient air data from every zip code in the country.

To get information on human health effects, Dominici uses Medicare data, including “every hospitalization for every person older than 65,” amounting to over 48 million subjects.

In all, the data (which continue to grow) add up to seven terabytes, Dominici said.

How much is a terabyte? It would take 1,000, 1-gigabyte flash drives to hold a terabyte. Now, imagine 7,000 of those flash drives—and you can wrap your mind around how much data Dominici has on her hands.

As a way to cope with the mass of information, Dominici explained that it helps to pick and choose what data to work with at any give time. She compared the process to using a storage closet—where you can put away winter clothes during the summer months and take them out again when it gets cold.

“The good news… is that you don’t need to manage it dynamically, all at once,” she said.

Despite the challenges of handling and analyzing such a vast amount of information, Dominici thinks the efforts will be fruitful.

“I have high confidence in the national study because I can see real improvements in getting sharper results as more data becomes available,” she said.

One study using the data, published in the Journal of the American Medical Association (JAMA), showed that causes of death and hospitalization related to air pollution differed in different parts of the country. “Cardiovascular risks tended to be higher in counties located in the Eastern region of the United States,” the study reported.

As analysis continues, other questions about air pollution risks will be answered. For now though, Dominici is neck deep in data, and it seems she likes it that way.

“As a statistician, I really like to do this because I can have an impact,” she said.

“Going from seven terabytes of data to estimates that have an impact on policy… it’s very, very satisfying.”

About the Author: A student contractor with EPA’s Office of Research and Development, Becky Fried is a regular “Science Wednesday” contributor.