This JAMA Guide to Statistics and Methods offers tips for handling big data, including important considerations regarding study population, methodology and sample size, data elements and presentation, and analysis and statistics.
With the advent of administrative databases and patient registries, big data is increasingly accessible to researchers. The large sample size of these data sets makes the study of rare outcomes easier and provides the potential to determine national estimates and regional variations. However, no database is completely free of bias and measurement error. With bigger data, random signals may denote statistical significance, and precision may be incorrectly inferred because of narrow confidence intervals. While many principles apply to all studies, the importance of these methodological issues is amplified in large, complex data sets.
STUDY POPULATION CONSIDERATIONS
It is important for the reader to understand how the investigator arrived at the study population. Usually, it is drawn from a larger source population to which inclusion criteria have been applied. A flowchart of the included and excluded participants, with the number excluded and reasons why, should be clearly delineated. Similarly, if the study is longitudinal, loss to follow-up should be reported. This will help readers understand any selection bias present.
METHODOLOGICAL AND SAMPLE SIZE CONSIDERATIONS
The objective and outcome(s) of the study should have been defined prior to data collection and analysis. If an author is looking for a difference in some variable between 2 cohorts, this difference and its confidence intervals should also be preplanned. The difference in the effect estimate should be reported as a patient-centered, clinically meaningful, and interpretable difference1 in addition to the statistical result (eg, regression coefficient, P value). Unfortunately, mining large data sets without preplanning can lead to unintentional, often mistaken conclusions. Statistical significance is related to sample size, and with a large enough sample, statistical significance between groups may occur with very small differences that are not clinically meaningful.
When reporting the results of observational studies, authors should consider following the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guidelines.2 The study design should be clearly described and be consistent with how the data were collected and analyzed, and the study results should be presented in a concise yet complete manner. There should be some statement that the study was performed after institutional review board approval or exemption was obtained. Authors should also describe whether any interim analyses were performed and if there were any protocol violations. Limitations should be reported to promote scientific integrity and validity of conclusions, which should be fully supported by the data analysis. Interpretations of observational studies should only lead to descriptions of associations between variables, not to conclusions of causality.
Although insufficient power would not seem to be a problem with large databases, this is simply not true. Study samples ...