American Journal of Law & Medicine

The Use and Misuse of Biomedical Data: Is Bigger Really Better?

Very large biomedical research databases, containing electronic health records (EHR) and genomic data from millions of patients, have been heralded recently for their potential to accelerate scientific discovery and produce dramatic improvements in medical treatments. Research enabled by these databases may also lead to profound changes in law, regulation, social policy, and even litigation strategies. Yet, is "big data" necessarily better data?

This paper makes an original contribution to the legal literature by focusing on what can go wrong in the process of biomedical database research and what precautions are necessary to avoid critical mistakes. We address three main reasons for approaching such research with care and being cautious in relying on its outcomes for purposes of public policy or litigation. First, the data contained in biomedical databases is surprisingly likely to be incorrect or incomplete. Second, systematic biases, arising from both the nature of the data and the preconceptions of investigators, are serious threats to the validity of research results, especially in answering causal questions. Third, data mining of biomedical databases makes it easier for individuals with political, social, or economic agendas to generate ostensibly scientific but misleading research findings for the purpose of manipulating public opinion and swaying policymakers.

In short, this paper sheds much-needed light on the problems of credulous and uninformed acceptance of research results derived from biomedical databases. An understanding of the pitfalls of big data analysis is of critical importance to anyone who will rely on or dispute its outcomes, including lawyers, policymakers, and the public at large. The Article also recommends technical, methodological, and educational interventions to combat the dangers of database errors and abuses.




    A. Ongoing Initiatives to Create Biomedical Databases
    B. Using Biomedical Databases and Data Networks

       1. Scientific Discovery
       2. Quality Assessment and Improvement
       3. Post-Marketing Surveillance of Drugs and Devices
       4. Public Health Initiatives
       5. Litigation


    A. Data Entry Errors
    B. Incomplete or Fragmented Data
    C. Data Coding, Standardization, and Extraction
    D. Errors Due to Software Failures


    A. Selection Bias
    B. Confounding Bias
    C. Measurement Bias



    A. Technology Improvements
    B. Human Hands

       1. Data Quality Assessment
       2. Causal Inference Techniques

    C. Education and Prevention of Research Misuse



In 2009, the Journal of Psychiatric Research published an article that linked abortion to psychiatric disorders. (1) The researchers examined "national data sets with reproductive history and mental health variables" to formulate their findings. (2) The study was widely cited among abortion opponents, (3) and several states enacted legislation requiring that women seeking abortions receive counseling that includes warnings about potential long-term mental health problems. (4) In 2012, however, the study was discredited by scientists who scrutinized its design and found that it was severely flawed. (5) The original researchers neglected to compare women with unplanned pregnancies who did have abortions to those who did not and failed to focus only on mental health problems that manifested after terminated pregnancies. (6) Thus, what appeared to be solid scientific evidence turned out not to be so, but not before having significant impact on some state legislatures.

The accelerating transition from paper medical files to electronic health records (EHR) systems (7) is facilitating the creation of large health information databases. (8) In the future, these may include significant genetic information because many EHRs will contain or be linked to genetic data about patients. (9) In addition, scientists are constructing large databases from genome sequencing projects. (10) Biomedical databases can serve as invaluable resources for researchers. There is justified enthusiasm about the potential for research using them to yield improved treatments and beneficial policy changes, and we have elaborated on the promise of such research in prior work. (11) Computer processing of digitized records permits fast and relatively inexpensive data analysis and synthesis, which can enable scientific discoveries and ultimately affect public policy and law. (12) Notably, the size and scope of integrated biomedical databases may allow researchers to overcome certain problems they encounter with smaller-scale studies, such as unrepresentative study groups and insufficient statistical power or precision. (13)

EHR-based research is likely to become increasingly important because of several federally sponsored initiatives. These include comparative effectiveness research that is promoted by the Patient Protection and Affordable Care Act of 2010 (14) and post-marketing surveillance authorized by the Food and Drug Administration Amendments Act of 2007. (15)

Anyone considering the outcomes of record-based studies, however, must recognize the shortcomings of contemporary EHR and genomic data and the challenges of inferring causal effects correctly. (16) Much has been written about EHR privacy risks, but this paper makes a different contribution to the legal literature by focusing on what can go wrong in the process of biomedical data analysis and what precautions must be taken to avoid critical mistakes. It sheds much-needed light on the problems of naive or irresponsible use of biomedical databases, and these problems are likely to become much more common and pressing in the near future. The data-use pitfalls we discuss are familiar to competent biomedical researchers but must be understood by lawyers, bioethicists, policymakers, and anyone else who will rely on research results.

We use the term "biomedical databases" to mean databases of EHRs and/or genomic information as well as decentralized, federated database systems. (17) Thus, in this paper, we address non-interventional research, that is, research that is based on review of records, which we also call "records-based research" or "observational research." (18) We do not intend to comment on clinical studies in which investigators conduct experiments using human subjects (19) or on research involving the administration of questionnaires or surveys.

Observational studies are relevant to the law because their outcomes can lead to regulatory enforcement actions or to legislative changes, and they can be used as evidence in litigation. For example, observational studies may reveal that use of a medication or device causes patients to suffer serious adverse events, and this discovery may induce the Food and Drug Administration (FDA) to intervene. (20) Observational studies may also uncover statistical associations between illnesses and exposure to certain substances or between diseases and genetic variations. (21) Reports of these associations may be used in litigation by both plaintiffs and defendants. (22) Plaintiffs may file tort cases against product manufacturers, and toxic tort defendants may in turn use scientific evidence to attack plaintiffs' claims and argue that something other than their products caused the plaintiffs' illnesses. (23)

News outlets frequently report new research findings. Press reports often trumpet the discovery that factor A is statistically associated with or "linked" to condition B. The availability of large biomedical databases greatly facilitates the discovery of such associations. However, the nature of such data can complicate the determination of whether factor A actually causes or contributes to condition B. We address three main reasons for a cautious approach to incorporating record-based research into the law.

First, the data contained in biomedical databases may be of poor quality, incomplete, or even deliberately distorted. (24) For example, a recent New York Times article reported that the automated features of EHR systems make it easy for doctors to exaggerate the care they provided for purposes of Medicare reimbursement. (25) Doctors can simply click on menu items or copy and paste narrative in order to justify billing, and some lack scruples with respect to overstating or even fictionalizing what occurred during clinical encounters. (26) Such practices not only defraud Medicare, but also compromise the accuracy of EHRs. Moreover, they can systematically bias research results.

Second, valid causal analysis is much more difficult with observational data than with data from well-designed and well-executed randomized experiments or clinical trials. (27) Unfortunately, having large amounts of data ("big data") does not necessarily ameliorate this problem. The challenges of properly analyzing observational data and making appropriate causal inferences (28) are illustrated in a paper entitled "Does Obesity Shorten Life? The Importance of Well-Defined Interventions to Answer Causal Questions." (29) The researchers critique previous observational studies of obesity and mortality and conclude that they were flawed because they failed to specify what interventions were used to reduce body mass index (BMI). Different methods of changing BMI (e.g., surgery, diet, exercise) are associated with different risk levels for patients, and mortality may actually be associated with the treatment rather than the underlying obesity in some cases. (30) Thus, researchers cannot reach meaningful conclusions about the benefits of reducing BMI without knowing what interventions were used to achieve this goal in each instance. (31)

Third, individuals with political, social, or economic agendas may "mine" or "dredge" biomedical databases to find links (statistical associations) between actions, behaviors, or policies, on the one hand, and outcomes of public interest, on the other hand, for the purpose of manipulating public opinion and swaying policy decisions. (32) The risk of misinterpretation of such results by interested parties is high if they are not well-trained and scrupulous researchers. Research about the purported link between abortion and psychiatric disorders, discussed above, demonstrates this potential danger. (33) Pro-life advocates used questionable scientific data to promote a controversial legislative agenda.

The paper proceeds as follows. Part II provides background information. It describes ongoing efforts to build biomedical databases and analyzes the relevance of observational studies to law and public policy. Part III analyzes common shortcomings of biomedical data that should give analysts and the public pause. These include input errors, incomplete or fragmented records, and flaws in data coding or standardization.

Part IV provides an in-depth discussion of causal inference and of biases affecting observational studies. It analyzes the challenges of inferring causation in observational studies, including the problems of selection bias, confounding bias, and measurement bias. Indeed, confounding bias and selection bias will likely be fundamental concepts in legal reasoning in big data environments. Part V addresses the potential use of observational study outcomes for purposes of furthering political, social, and economic agendas.

Finally, Part VI analyzes the factors that contribute to sound research and provides guidance for policymakers and litigants seeking to determine whether particular research outcomes are reliable. The quality of digitized research databases and the studies that grow out of them will depend not only on good technology, but also on persistent human efforts to safeguard the integrity of research projects. Technological advances are needed to enhance interoperability, data capture, data-extraction capabilities, and system usability. In addition, clinicians and patients can partner to assess the validity of the data contained in EHRs, and investigators must be scrupulous about study design, analysis, and publication. This Part also describes and critiques the use of causal inference diagrams, which have received little attention in the legal literature but is increasingly common in other fields. (34)

Equally important is ensuring that the legal community, journal editors, and the public at large are not misled by those who appear to engage in scientific endeavors but who in truth misuse evidence to promote their own political, social, or economic agendas. Legal practitioners must understand the complex issues raised by big data in order to play a useful role in protecting the public's interests. To this end, we recommend the development of law school and other educational programs about the challenges of observational data analysis and causal inference.


Researchers and other analysts may gain access to large-scale collections of biomedical data in two primary ways. First, health information can be collected into large databases and de-identified to protect patient privacy. (35) Such databases could be limited to particular hospital systems, be expanded to cover entire regions, or even be national in scope. (36) In the alternative, researchers may use a "federated system" by which medical institutions manage and maintain control of their own databases, but they allow researchers to submit statistical queries through a standard web service in order to obtain summary statistics for a study population. (37) Trusted third-party aggregators can operate the query service. (38)

Many large biomedical databases and federated systems already exist and are used for non-treatment purposes. (39) The term "secondary use" refers to the utilization of health information outside the clinical setting. (40) This Part describes a sample of data-collection initiatives. It also discusses how experts in the biomedical research, quality assessment, public health, and litigation arenas may utilize EHR data.


The Federal Government has clearly recognized the usefulness of biomedical databases and enthusiastically supports database projects. The Obama Administration has announced an overarching effort called the "Big Data Research and Development Initiative" ("Big Data"). (41) The initiative's purposes are to advance cutting-edge technologies needed to gather and process "huge quantities of data;" to employ those technologies to promote scientific discovery, improved national security, and education; and to expand the workforce skilled in these technologies. (42) Big Data will involve six federal agencies and departments and is estimated to cost $200 million. (43) As part of Big Data, the National Institutes of Health (NIH) will make data from its 1000 Genomes Project publicly available through cloud computing. (44)

At the same time, many federal entities are independently building health information databases. (45) For example, the Department of Veterans Affairs (VA) is registering volunteers for its Million Veteran Program to construct a large research framework that will link anonymized blood samples and health information. (46) The VA plans to study how genes affect health and disease. (47)

The Centers for Medicare & Medicaid Services created a research database called the Chronic Condition Data Warehouse. (48) The database provides researchers with information about Medicare and Medicaid beneficiaries, claims for services, and assessment data. (49)

In May of 2008 the FDA launched the Sentinel System in order to facilitate post-marketing surveillance and early detection of medical products' safety problems. (50) The Sentinel initiative aims to enable the FDA to access health information from 100,000,000 individuals. (51) Sentinel is a federated system that will allow the FDA to send queries concerning potential product-safety problems to data holders such as Medicare, the VA, and major medical centers. (52) Using special analysis programs, the data holders will assess their records and send summary responses to the FDA. (53)

A large number of private-sector initiatives are ongoing as well. Geisinger Health Systems operates MedMining, a company that extracts EHR data, de-identifies it, and offers it to researchers. (54) The data sets that MedMining delivers to its customers include "lab results, vital signs, medications, procedures, diagnoses, lifestyle data, and detailed costs" from inpatient and outpatient facilities. (55)

Explorys has formed a large healthcare database derived from financial, administrative, and medical records. (56) It has partnered with major healthcare organizations such as the Cleveland Clinic Foundation and Summa Health System to aggregate and standardize health information from ten million patients and over thirty billion clinical events. (57) Using a cloud-computing platform, it provides customers with big data to use for research and quality improvement purposes. (58)

The electronic Medical Records and Genomics Network (eMERGE) is a consortium of five institutions with DNA repositories linked to EHRs that supply relevant clinical data. (59) The National Human Genome Research Institute supports eMERGE, and the National Institute of General Medical Sciences provides it with additional funding. (60) Each eMERGE center will study "the relationship between genome-wide genetic variation and a common disease/trait," using genome-wide association analysis. (61) A primary purpose of eMERGE is to develop approaches to conducting large-scale genetic research using DNA biobanks that are connected to EHR systems. (62)

The Distributed Ambulatory Research in Therapeutics Network Institute (DARTNet) is a collaboration among nine research networks, including 85 healthcare organizations and over 3,000 clinicians across the United States. (63) The first DARTNet federated network, eNQUIRENet, was created in 2007 and funded by the Agency for Healthcare Research and Quality. (64) DARTNet members allow data from their EHRs to be captured, de-identified, coded, standardized, and stored in a Clinical Data Repository (CDR) within each entity that also connects to billing, lab, hospital, and prescription databases. (65) CDR data are then transferred to a second database that makes de-identified information available to researchers through a secure web portal. (66)

Other agencies and organizations are building electronic registries and databases that focus on specific disease categories in an effort to promote research and quality improvement endeavors. These include the Cancer Biomedical Informatics Grid, (67) the Interagency Registry for Mechanically Assisted Circulatory Support, (68) the Extracorporeal Life Support Organization, (69) and the United Network for Organ Sharing. (70)


Large-scale biomedical databases may be used for many purposes. This section addresses a variety of ways in which they are likely to be used by researchers, regulators, public health officials, commercial entities, and lawyers. As we have indicated, biomedical databases constitute an important tool for medical researchers. They are also used by healthcare providers who conduct quality assessment and improvement activities, and they assist the FDA in monitoring the safety of drugs and devices on an ongoing basis. In addition, biomedical databases can support public health initiatives and allow litigants in tort cases to develop evidence concerning causation and harm.

1. Scientific Discovery

Biomedical databases can enable researchers to conduct large -scale observational studies that will fill existing knowledge gaps. Even today, clinicians practice medicine with an unsettling degree of uncertainty. (71) According to some estimates, doctors know that the treatments they prescribe will be effective in only twenty to twenty-five percent of cases. (72) Database proponents believe that records-based research could contribute substantially to the resolution of these uncertainties. (73)

Biomedical databases could allow researchers to access a vast quantity of information about millions of patients who are treated in varied clinical settings, have diverse attributes, and live in different regions of the country. (74) Available information could include patients' medical histories over their entire lifetimes. The data reviewed in database studies, consequently, may be far more abundant and comprehensive than the data generated by clinical trials, (75) which are rigorously controlled and often involve fewer than 3000 patients. (76) Large-scale studies have the potential to better reflect the entire population and expose how treatments are actually used in a large variety of medical facilities. (77) They also tend to enhance the precision of statistical analyses. (78)

If the researchers aim to show whether a specific treatment achieves the desired benefits, they may reasonably choose to conduct a randomized clinical trial to ensure that uncontrolled variables that influence outcomes, such as age or drug interactions, do not confound the study. (79) However, observational studies may be needed to determine whether the results of randomized clinical trials that involved only a few thousand patients can be generalized to the patient population at large and to realistic treatment situations rather than carefully controlled ones. (80) Furthermore, observational research based on medical records will often be sufficient to determine a treatment's adverse effects. (81) It is also useful for generating and testing speculative hypotheses that could lead to important insights. (82) Observational studies are often less costly and time-consuming than experimental research, especially when researchers obtain the required data from existing databases. (83)

The benefits of observational studies are illustrated by the highly publicized controversy concerning an alleged association between vaccination and autism. In 1998, Dr. Andrew J. Wakefield and colleagues published a study in the Lancet that suggested a link between autism and the measles, mumps, rubella (MMR) vaccination. (84) The findings were based on testing of twelve children with developmental disorders. (85) In 2004 most of the authors "retracted the interpretation placed upon these findings in the paper" (86) after large-scale observational research involving the review of hundreds of records of autistic children in the United Kingdom found no causal association between the MMR vaccine and autism. (87) Consequently, the Centers for Disease Control and Prevention (CDC) now reassures the public on its website that there is no link between autism and vaccines. (88)

For purposes of genetic research, EHRs can be coupled with genetic samples and data so that analysts can obtain detailed and comprehensive characterizations of study subjects. (89) An increasingly common form of big-data observational research is genome-wide association studies (GWASs). (90) GWASs compare the DNA of individuals with a particular disease or condition to the DNA of unaffected individuals in order to find the genes involved in the disease. (91) A government website catalogues published GWASs and on October 24, 2013, listed 1,727 studies that had been conducted since 2005. (92) Critics have noted that although GWASs led to the discovery of many genetic variants that are statistically associated with disease; thus far, most of the variants appear to have a minimal effect on disease and explain only a small percentage of heritability. (93) Others assert that many GWASs to date have been compromised by serious design flaws. (94) However, GWASs remain an important scientific endeavor and will likely lead to significant discoveries in the future.

A different method of scanning the genome is genome-wide linkage studies (GWLSs). Researchers perform GWLSs when they are focusing on biologically related individuals and a phenotype, such as breast cancer, that some but not all of the family members have. …

Log in to your account to read this article – and millions more.