Skip to content

Science Wednesday: OnAir – Huge Datasets Pose Challenges but Hold Promise

2010 February 17

During a recent visit to Harvard, I sat down with Francesca Dominici, a biostatistician and former director of the Johns Hopkins Particulate Matter Research Center.

Dominici confessed that she has spent much of her time at Harvard thus far figuring out how to transfer, store and manage all of the data that has accumulated over years of research.

How hard could it be to move data, I wondered?

Her projects at Hopkins included a national study showing hospital admissions and mortality associated with exposure to air pollution particles.

“We’re using all data on particulate matter and particulate matter composition for every single monitoring station in the United States from the first date it has been available up until 2007.”

This includes years’ worth of ambient air data from every zip code in the country.

To get information on human health effects, Dominici uses Medicare data, including “every hospitalization for every person older than 65,” amounting to over 48 million subjects.

In all, the data (which continue to grow) add up to seven terabytes, Dominici said.

How much is a terabyte? It would take 1,000, 1-gigabyte flash drives to hold a terabyte. Now, imagine 7,000 of those flash drives—and you can wrap your mind around how much data Dominici has on her hands.

As a way to cope with the mass of information, Dominici explained that it helps to pick and choose what data to work with at any give time. She compared the process to using a storage closet—where you can put away winter clothes during the summer months and take them out again when it gets cold.

“The good news… is that you don’t need to manage it dynamically, all at once,” she said.

Despite the challenges of handling and analyzing such a vast amount of information, Dominici thinks the efforts will be fruitful.

“I have high confidence in the national study because I can see real improvements in getting sharper results as more data becomes available,” she said.

One study using the data, published in the Journal of the American Medical Association (JAMA), showed that causes of death and hospitalization related to air pollution differed in different parts of the country. “Cardiovascular risks tended to be higher in counties located in the Eastern region of the United States,” the study reported.

As analysis continues, other questions about air pollution risks will be answered. For now though, Dominici is neck deep in data, and it seems she likes it that way.

“As a statistician, I really like to do this because I can have an impact,” she said.

“Going from seven terabytes of data to estimates that have an impact on policy… it’s very, very satisfying.”

About the Author: A student contractor with EPA’s Office of Research and Development, Becky Fried is a regular “Science Wednesday” contributor.

Editor's Note: The opinions expressed here are those of the author. They do not reflect EPA policy, endorsement, or action, and EPA does not verify the accuracy or science of the contents of the blog.

Please share this post. However, please don't change the title or the content. If you do make changes, don't attribute the edited title or content to EPA or the author.

7 Responses leave one →
  1. armansyahardanis permalink
    February 17, 2010

    Thank you God, you choose The Human who create thats computer to our planets. I support Noble Prize should be give to Internet Inventer Team….. include you Mom, Francesca Dominici.

  2. Alexandr permalink
    February 17, 2010

    All problems are solved, we have desire to decide all difficulties which have arisen at realization of this project. The project should be divided into two parts.
    1-at preservation industrial development, and rates of growth of economy.
    2-to increase charges (monetary injections) on maintenance and preservation of an environment.
    To create balance of balance of consumption and restoration of natural resources. All would be solved desire to make and solve all problems. And problems in transitive the period set.

  3. jim permalink
    February 17, 2010

    I really enjoy your email information. I would like some Pictures to identify projects that get chosen for grants.Would like to see the whole project from the written paperwork, to the end.It can teach us. WE need the money for so many reasons. Not just climate and jobs but to follow the money.Money to see the who what why where and when. Our rescues need a video to show how we get out of the worst economic times.
    Chu lets get more Bio-Hydrogen jobs.
    Thank you

  4. Jackenson Durand permalink
    February 17, 2010

    There is no question about magic of Technology.
    It is always better to use natures’ tools in the good way.
    We are not going to look over the unfortunate data damages, when you use it for others purposes.
    We are going to study, how it contributes to solve human incapacity while look for solving problems, accumulating and save millions information’s.

  5. Al Bannet permalink
    February 18, 2010

    Becky Freid,

    So you feel powerful and gratified surrounded by terrabytes if data? Well, at leasy you’re honest about it. But the rest of humanity outside of the bureaucracies feel overwhelmed and swamped by such oceans of complexity, which must get worse as the economy and the population keep on growing relentlessly. But the Earth is not growing. Instead, it is slowly shrinking with each quake and volcano. So, the question is: how much of our human industries can the Earth absorb. I assure you it is not unlimited — but any rational person could figure that out, it’s so obvious.

  6. Michael E. Bailey permalink
    February 19, 2010

    It looks like a treasure chest of data that can be used to illuminate good, thoughtful public policy in the integrated areas of transportation, land use and sustainable economic development. The California Air Resources Board has a similar study that came out a month or two ago that also shows negative impacts on the cardiovascular systems of senior citizens who are in freeway traffic for long periods of time. The major cause of the cardiovascular problems seems to be from diesel particulates in the air over and near the freeways. This is another reason why we need to move away from diesel to CNG, LNG, and hydrogen fuels. Cleaner air means healthier, life saving air. Best wishes, Michael E. Bailey.

  7. Mowgli permalink
    March 18, 2010


Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS