Science Wednesday: OnAir – Huge Datasets Pose Challenges but Hold Promise

During a recent visit to Harvard, I sat down with Francesca Dominici, a biostatistician and former director of the Johns Hopkins Particulate Matter Research Center.

Dominici confessed that she has spent much of her time at Harvard thus far figuring out how to transfer, store and manage all of the data that has accumulated over years of research.

How hard could it be to move data, I wondered?

Her projects at Hopkins included a national study showing hospital admissions and mortality associated with exposure to air pollution particles.

“We’re using all data on particulate matter and particulate matter composition for every single monitoring station in the United States from the first date it has been available up until 2007.”

This includes years’ worth of ambient air data from every zip code in the country.

To get information on human health effects, Dominici uses Medicare data, including “every hospitalization for every person older than 65,” amounting to over 48 million subjects.

In all, the data (which continue to grow) add up to seven terabytes, Dominici said.

How much is a terabyte? It would take 1,000, 1-gigabyte flash drives to hold a terabyte. Now, imagine 7,000 of those flash drives—and you can wrap your mind around how much data Dominici has on her hands.

As a way to cope with the mass of information, Dominici explained that it helps to pick and choose what data to work with at any give time. She compared the process to using a storage closet—where you can put away winter clothes during the summer months and take them out again when it gets cold.

“The good news… is that you don’t need to manage it dynamically, all at once,” she said.

Despite the challenges of handling and analyzing such a vast amount of information, Dominici thinks the efforts will be fruitful.

“I have high confidence in the national study because I can see real improvements in getting sharper results as more data becomes available,” she said.

One study using the data, published in the Journal of the American Medical Association (JAMA), showed that causes of death and hospitalization related to air pollution differed in different parts of the country. “Cardiovascular risks tended to be higher in counties located in the Eastern region of the United States,” the study reported.

As analysis continues, other questions about air pollution risks will be answered. For now though, Dominici is neck deep in data, and it seems she likes it that way.

“As a statistician, I really like to do this because I can have an impact,” she said.

“Going from seven terabytes of data to estimates that have an impact on policy… it’s very, very satisfying.”

About the Author: A student contractor with EPA’s Office of Research and Development, Becky Fried is a regular “Science Wednesday” contributor.

Editor's Note: The views expressed here are intended to explain EPA policy. They do not change anyone's rights or obligations. You may share this post. However, please do not change the title or the content, or remove EPA’s identity as the author. If you do make substantive changes, please do not attribute the edited title or content to EPA or the author.

EPA's official web site is www.epa.gov. Some links on this page may redirect users from the EPA website to specific content on a non-EPA, third-party site. In doing so, EPA is directing you only to the specific content referenced at the time of publication, not to any other content that may appear on the same webpage or elsewhere on the third-party site, or be added at a later date.

EPA is providing this link for informational purposes only. EPA cannot attest to the accuracy of non-EPA information provided by any third-party sites or any other linked site. EPA does not endorse any non-government websites, companies, internet applications or any policies or information expressed therein.