Unleashing big data against disease
Just as the National Security Agency seeks to ferret out terrorists or other trouble-makers by mining the vast quantity of information in various electronic media, disease detectives are trolling social media to identify and track down global health trends or threats.
But NSA surveillance has also heightened concerns over privacy and mis-use of that information. So health officials and data experts are pre-emptively raising questions about the potential problems of using big data for epidemiology:
- Is it OK if, working backward from a kid’s Tweet, we figure out where he goes to school, what he does with his girlfriend, where he goes on vacation and where, exactly, he’s sitting in his house when he’s Tweeting?
- Is the snooping justified if it reveals he’s sexually active, and this triggers placement of information about the HPV vaccine in the ad space on his Facebook page?
- Or how about if he and his friends are Tweeting about feeling sick—fever and cough—and we use that data to help public health get a jump on a flu outbreak?
So went the discussions last week at the International Conference on Digital Disease Detection in San Francisco, a mind meld of innovators in informatics, genomics and public health on wrangling big data to benefit pubic health. That is, how to fight disease by leveraging the 1.1 billion Facebook and 500 million Twitter users, 1 billion monthly YouTube visitors, and proliferation of personal health apps and electronic health records coming online daily.
In rapid fire talks, a succession of researchers blazed through examples of exploiting data mining and crowd sourcing to boost health: scanning social media posts to accelerate detection of a bird flu outbreak in Singapore; mining electronic health records to detect patterns in drug or medical device failures in the US; surveying international events to flag potential threats along the food supply chain. (Tip: Don’t buy shrimp shipped from a region whose markets are closed following floods.)
A group working outside Chang Mai, Thailand, told the story of Saraphi Health, which is arming community volunteers with tablets and smartphones to develop and Google Map detailed data on its 80,000 residents (47,000 down, 33,000 to go) that will enable the district’s scarce health workforce to partner with patients, their families and volunteers to better prevent and manage chronic disease.
The use of digital information to fight disease has spiked over the last decade as data have become cheaper and easier gather and analyze. The surge began with applications like Google Flu Trends, which mines data on flu-related searches, and HealthMap, which aggregates online sources from eyewitness to newspaper to official reports to provide a real-time picture (map) of disease outbreaks and other public health events. In recent years, it has expanded to include apps like HealthMap’s Flu Near You, a crowdsourcing tool that gathers anonymous reports of flu and in turn provides users with information and resources.
While these advances offer great promise—they’re credited with telescoping the time it can take to detect a potential pandemic from nearly half a year to just over three weeks—they are decidedly a work in progress. At the peak of last year’s flu season, as Nature and other media reported, Google Flu Trends estimated more than 10 percent of the population had flu, where official reports put the number at closer to six percent, an error some attributed to an aggressive media campaign on flu, which prompted a high number of searches.
Algorithms are being refined and assumptions reexamined to get at how to make the data more reliable. Or, as a guy on the exhibit floor from Topsy, a program that mines and analyzes social media, put it, how to monitor for fever without having your results distorted by Tweets referencing “Bieber Fever.” Or how to account for different terms and diagnostic practices, and better consider context to provide more reliable and comprehensive regional and global pictures.
Eliminating the “noise” and bias in the data to enhance its reliability clearly was top of mind among conference participants.
Also trending at the confab:
- Making ethical use of personal data extracted from social media. (See scenario above, posed by Caitlin Rivers, from the Virginia Bioinformatics Institute at Virginia Tech.) General consensus: It’s ok to use that data in the aggregate, without names or personal details attached. Beyond that, there’s debate.
- Combining data sources—Better integrating data from a range of sources, including those on animal and environmental hazards.
- Putting the data in hands of those who can put it to the best use—Figuring out who “owns” this data, and how to make it useful for public health, individuals and other health practitioners. This a complex challenge, logistically, financially, and ethically, especially when most public health departments are in duck-and-cover mode with slashed funding.
As tantalizing as the possibilities are, there’s no talk of handing everything over to the search bots. Digital detection may have made it a lot easier for John Snow to map cholera cases to the contaminated Broad Street water pump, but we’d still need him to figure out what questions to ask, and what to do with the answers.