American Journal of Law & Medicine

Barbarians at the Gate: Consumer-Driven Health Data Commons and the Transformation of Citizen Science

"The expression 'barbarians at the gate' was... used by the Romans to describe foreign attacks against their empire." (1) "[It] is often used in contemporary English within a sarcastic, or ironic context, when speaking about a perceived threat from a rival group of people, often deemed to be less capable or somehow 'primitive.'" (2) 


Citizen science has a mixed reputation. It includes well-organized, crowd-sourced efforts like the 1830s British great tidal experiment that enlisted members of the public to monitor tides at 650 coastal locations, but it also includes less rigorous dabbling by amateurs. (3) Successful citizen science projects often engage laypeople, supervised by professional scientists, to collect and analyze data or even to assist in creating the finished scholarly works (4)--in other words, the citizens are a source of labor. This article explores an alternative model--citizen research sponsorship--in which citizens supply essential capital assets to support research. The assets could be monetary (research funding, for example), although this article focuses on a different kind of capital: data resources, which are a critical input for twenty-first century biomedical research. (5)

The citizen research sponsorship model flips the traditional control relationship of citizen science. Instead of laypeople laboring under the supervision of professional scientists, the professional scientists work at the instigation of citizen groups, using the people's data for projects the people endorse. Citizen groups that control an essential research input, such as data or biospecimens, sometimes succeed in leveraging their asset to enlist qualified scientists to generate desired knowledge. This sponsorship model was exemplified late in the 1980s when a group of Canavan-disease-affected families developed a disease registry and biospecimen bank and leveraged these resources to spur discovery of associated genetic variants and development of a diagnostic test. (6) Their sponsorship took the form of supplying data and biospecimens for the research, as opposed to providing funding. This revealed a new dynamic in the era of informational research (7) that mines preexisting health records and data derived from biospecimens: money will follow a good data resource, instead of data resources following (and having to be generated by) those who hold money. Data resources are a central currency of twenty-first-century science, and the question is, "Who will control them?"

The Canavan families' scientific success was later marred by litigation when their chosen investigator elected to patent his discoveries and charge royalties on the test. (8) They had naively assumed he would put his discoveries into the public domain. (9) Citizen sponsors, like any other research sponsors, need well-drafted research agreements if they want to avoid unpleasant surprises. The Canavan families' greatest contribution to science ultimately may have been that they demonstrated the power of well-organized citizen groups--perhaps, next time, with appropriate consulting and legal support--to instigate high-quality scientific research. Hiring lawyers and scientists is relatively straightforward if a citizen group has money, and money need not always come from external fundraising and donors. A citizen group that controls a critical data resource, coupled with a workable revenue model, may be able to monetize its resource lawfully and on terms ethically acceptable to the group members.

This article introduces consumer-driven data commons, which are institutional arrangements for organizing and enabling citizen research sponsorship. The term "consumer-driven" reflects a conscious decision to avoid terms like "patient-driven" or "patient-centered" that are ubiquitous yet evoke different meanings in the minds of different readers. (10) For some, "patient-centered" may refer to research that respects patient preferences, decisions, and outcomes, and "helps people and their caregivers communicate and make informed healthcare decisions, allowing their voices to be heard in assessing the value of healthcare options." (11) Others may view a system as patient-centered if it extends various courtesies to participating individuals, such as sharing progress reports about discoveries made with their data or enabling return of their individual research results. (12) Others may associate patient-centered databases with granular consent mechanisms that allow each included individual to opt in or opt out of specific data uses at will, like the patient-controlled electronic health records proposed in the last decade. (13) The term "consumer-driven" aims to quash preconceptions of this sort.

The choice of "consumer" rather than "patient" accords with a broad conception that health data include information about people in sickness (when their consumption of healthcare services generates clinical data) and in health (when they may purchase fitness trackers that generate useful information about lifestyle and exposures). This article construes personal health data (PHD) broadly to include data about patients, research participants, and people who use sensor devices or direct-to-consumer testing services (together, consumers). PHD includes traditional sources of health data, such as data from consumers' encounters with the healthcare system as well as data generated when they consent to participate in clinical research. PHD also may include individually identifiable (or re-identifiable) research results that investigators derive during informational research--research that uses people's data or biospecimens with or without their consent. (14) Increasingly, PHD includes genetic and other diagnostic information that healthy-but-curious consumers purchase directly from commercial test providers, as well as information people generate for themselves using mobile, wearable, and at-home sensing devices. More creepily, PHD also includes data captured passively by the panopticon of algorithms that silently harvest data from online shopping, professional, leisure, and social communication activities. (15) Such algorithms may support excruciatingly personal inferences about an individual's health status--for example, pregnancy--that arouse intense privacy concerns. (16) All these data are potential fuel for future biomedical discoveries.

The consumer-driven data commons discussed here would be self-governing communities of individuals, empowered by access to their own data, who come together in a shared effort to create high-valued collective data resources. These data commons are conceptually similar to the "data cooperatives, that enable meaningful and continuous roles of the individuals whose data are at stake" that Effy Vayena and Urs Gasser suggest for genomic research, (17) and to "people-powered" science that aims to construct communities to widen participation in science, (18) and to the "patient-mediated data sharing" described in a recent report on FDA's proposed medical device safety surveillance system. (19)

Consumer-driven data commons differ starkly from the traditional access mechanisms that have successfully supplied data for biomedical research in recent decades. This article explores how these mechanisms, imbedded in major federal research and privacy regulations, enshrine institutional data holders--entities such as hospitals, research institutions, and insurers that store people's health data--as the prime movers in assembling large-scale data resources for research and public health. They rely on approaches--such as de-identification of data and waivers of informed consent--that are increasingly unworkable going forward. They shower individuals with unwanted, paternalistic protections--such as barriers to access to their own research results--while denying them a voice in what will be done with their data. Consumer-driven data commons also differ from many of the patient-centered data aggregation models put forward as alternatives to letting data holders control the fate of people's data. One alternative, already noted, is a personally controlled electronic health record with granular individual consent: that is, a scheme in which individuals (or their designated agents) assemble their own health data and then specify, in very granular detail, the particular data uses that would be acceptable to each individual.

In contrast, the consumer-driven data commons proposed here would aggregate data for a group of participating volunteers who, thereafter, would employ processes of collective self-governance to make decisions about how the resulting data resources--in the aggregate, as a collective data set--can be used. The group's collective decisions, once made, would be binding on all members of the group (at least until a member exited the group), but the decisions would be made by the group members themselves, according to rules and processes they established. This article explores the promise and the challenge of enabling consumer-driven data commons as a mechanism for consenting individuals to assemble large-scale data resources. Twenty-first century science, as discussed below, (20) needs large-scale, deeply descriptive, and inclusive data resources. Granular, individual consent can make it difficult to assemble such resources, which require collective action.

There are many competing visions of the public good and how to advance it. This analysis presumes, as its starting point, that the public good is served when health data are accessible for biomedical research, public health studies, (21) regulatory science, (22) and other activities that generate knowledge to support continuous improvements in wellness and patient care. The goal here is not to debate this vision but rather to assume it and study how competing legal and institutional arrangements for data sharing may promote or hinder the public good and address people's concerns about privacy and control over their PHD.


Consumer-driven data commons have the potential to elevate citizen science from its perceived status as do-it-yourself puttering and transform it into a force for addressing some of the grand scientific challenges of the twenty-first century. These challenges include several programs initiated during the Obama Administration, such as the Precision Medicine (23) and Brain Research through Advancing Innovative Neurotechnologies (BRAIN) (24) Initiatives and the Cancer Moonshot. (25) They also include efforts to clarify the clinical significance of genomic variants and to ensure that modern diagnostics are safe and effective. (26) Another major challenge is to develop a "learning health care system" (27) that routinely captures data from treatment settings, as well as from people's experiences as non-patients before and after their healthcare encounters, to glean insights to support continual improvements in wellness and patient care.

These scientific challenges all share a common feature: they require access to very large-scaled data resources--sometimes, data for tens to hundreds of millions of individuals (28) (known as "data partners" (29) in the nomenclature of the Precision Medicine Initiative). The most valuable data resources are deeply descriptive in the sense of reflecting, for each individual, a rich array of genomic and other diagnostic test results, clinical data, and other available PHD such as data from mobile and wearable health devices that may reflect lifestyle and environmental factors influencing health. (30) The data need to be longitudinal in the sense of tracing, as completely as possible, the history of a person's innate characteristics, factors that may have influenced the person's health status, diagnoses during spells of illness, treatments, and health outcomes. (31)

Such data, unfortunately, are inherently identifiable. Access to at least some identifiers is necessary, at least in certain phases of database creation, in order to link each person's data that is arriving from different data holders, to verify that the data all pertain to the same individual, and to update the person's existing data with subsequent clinical observations. (32) Once the data have been linked together to create a longitudinal record for each individual, the identifiers could be removed if there is no need to add subsequent data about the individual. Even if overt identifiers like names are stripped off after the data are linked together, the resulting assemblage of data--deeply descriptive of each individual--potentially can be re-identified. (33) If a dataset contains a rich, multi-parametric description of a person, there may be only one individual in the world for whom all of the parameters are a match. If other, external datasets link a subset of those parameters to the person's identity, re-identification may be possible. (34)

For some important types of research, the data resources also need to be highly inclusive, in the sense that most (or even all) people are included in the dataset. (35) Inclusive data sets capture rare events and allow them to be studied and avoid consent bias (selection bias). (36) Empirical studies suggest that people who consent to having their data used in research may have different medical characteristics than the population at large. (37) For example, patients who are sick and have symptoms may feel more motivated than asymptomatic people are to volunteer for studies that explore possible genetic causes of their symptoms. If true, then a cohort of consenting research subjects may over-represent people who carry a specific gene variant and also happen to be ill. The study may reach biased conclusions misstating how often the variant results in actual illness.

Consent bias reportedly was a factor that contributed to a tendency for early studies to overstate the lifetime risk of breast and ovarian cancer in people with certain BRCA genetic mutations. (38) Costs of testing were high under the gene patenting doctrine of the day; insurance reimbursement criteria tended to make clinical BRCA testing available only to people with a personal or family history of these cancers; such people also were highly motivated to share their data for use in research. (39) As a broader population gains access to BRCA testing, the available data resources are gradually expanding to include more people who have mutations without developing cancer, and lifetime risk estimates are trending downward. (40) Getting these numbers right has obvious impact on future patients who face decisions based on their test results.

One possible way to create large, deeply descriptive, inclusive datasets free of consent bias would be to force all citizens to contribute their data, in effect requiring them to pay a "data tax" (an exaction of part of their data) just as we all must pay income taxes. That idea, seemingly, would be repugnant to many, and 1 do not propose it except to contrast it with a rarely considered policy that this article seeks to advance: Why not get people to want to participate in large-scale, deeply descriptive, inclusive datasets for use in research? Why not make participation interesting and enjoyable, perhaps even fun? Current ethical and regulatory frameworks that govern data access, such as the Common Rule (41) and Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy Rule (42) decisively fail to do this. Have we, as a society, unwittingly embraced a prune-faced framework of bioethics, such that making it fun for people to participate in informational research seems coercive or ethically problematic? If so, how did it come to this and what can we do about it?

The Common Rule and HIPAA Privacy Rule both provide workable pathways to obtain data--if necessary, without consent--for socially important research, public health purposes, and regulatory science. (43) But the resulting data uses are not fun--indeed, people often do not know the uses even occurred--and unconsented data access, even when it is legal, will always be controversial. (44) Surveys show that "the majority of consumers are positive about health research, and if asked in general terms, support their medical information being made available for research" (45)--in other words, they see research participation as potentially fun--but they want to be asked before their data are taken and they prefer for their data be de-identified. (46) Sadly, as just noted, de-identification may no longer be feasible, and even if it were feasible, it cannot support creation of deeply descriptive, longitudinal data that twenty-first-century science needs. (47)

The existing regulations, which were designed for clinical research and for small-data informational studies of the past, function well enough and may continue to function, at least for those who are sufficiently well-lawyered to thread the needle of data access. But they do not excite people about becoming partners in the grand scientific challenges of the twenty-first century, which ought to be easy given how fascinating these challenges are. Current regulations sometimes insult the very people whose data investigators want to use, showering individuals with unwanted, paternalistic protections--such as barriers to the return of research results (48)--while denying them a voice in what will be done with their data. Data partners' only real "voice" is their right to withhold consent and, in effect, take their data and go home. Even that right can be waived by an Institutional Review Board, (49) typically staffed by employees of institutions that wish to use the people's data and whom the people never chose to represent their interests. (50)

Most people have no wish to take their data and go home. Surveys suggest that eighty percent of Americans would like to see their data used to generate socially beneficial knowledge. (51) They want to participate, but subject to privacy, data security protections, and other terms that are transparent and satisfactory to themselves. (52) Consumer-driven data commons are a vehicle for enabling consumers to set and enforce those terms through collective self-governance and to find the voice that ethics and regulatory frameworks consistently deny them.


There are multiple, viable pathways for developing heath data commons to promote public good, and it will be important for policymakers to have the wisdom to allow them to evolve in parallel during early phases of the effort.

The first major pathway, (53) resembling propertization, bestows entitlements (such as specific rights of access, rights to transfer and enter transactions involving data, rights to make managerial decisions about data, or even outright data ownership) on specific parties. It then relies on those parties to enter private transactions to assemble large-scale data resources. The initial endowment of rights can be bestowed various ways: on the individuals to whom the data relate (patients and consumers); on data holders such as hospitals, insurers, research institutions, and manufacturers of medical and wearable devices that store and possess people's data; on both groups; or on other decision-makers.

A second major pathway is to develop data resources in the public domain (5)--for example, through legislation or regulations that force entities that hold data to supply it for specific public health or regulatory uses, or by using public funds (e.g., grants or tax incentives) to create data resources under rules that make them openly available for all to use (or for use by a designated group of qualified entities, such as public health officials or biomedical researchers, who are legally authorized to use data on the public's behalf).

A third pathway is to foster creation of data commons, which are distinct from the other two pathways and can include many different types of commons that may exist simultaneously. (55)

This section briefly clarifies the relationship among data ownership, data commons, and the public domain.

In 2014, the Health Data Exploration Project surveyed a sample consisting primarily of people who track their PHD and found that 54% believe they own their data; 30% believe they share ownership with the sensor company or service provider that enables collection of their data; 4% believe the service provider owns the data; and only 13% profess indifference. …

Log in to your account to read this article – and millions more.