The gold standard for comparing two or more therapies in modern medicine is the prospective, double-blind, randomized clinical trial. In this model, especially with sizable enrollment, randomization is assumed to equalize outcome-influencing factors across study arms such that they have no unbalanced effect on outcomes tested. Recently, clinical studies have increasingly utilized registries, electronic medical records, and other real-world observational data sets in place of or to supplement randomized trials. Notably, nonrandomized studies are subject to confounding when enrollees who receive one treatment under investigation differ systematically from those receiving another, including selection bias where patient treatments are chosen by their physicians rather than by randomization. In such studies, statistical adjustment by propensity-score matching (PSM) is commonly employed in an attempt to reduce bias from concomitant confounding variables (to correct for many baseline imbalances). PSM attempts to mimic randomization on observed covariates. PSM is also frequently used in post-hoc, retrospective, and subgroup analyses for similar reasons. Importantly, while PSM is valuable, it is not all-inclusive. It is not readily apparent that practitioners recognize this clinically important limitation and consider it when interpreting/applying study results. This paper provides supporting details regarding the above, and uses a comparison of two atrial fibrillation trials and two CHA2DS2-VASc circumstances to enhance the points made.
The “gold standard” for comparing two or more therapies in modern medicine is the prospective, double-blind, randomized clinical trial. In this model, especially with sizable enrollment, randomization is assumed to equalizeoutcome-influencing factorsacross study arms such that they have no unbalanced effect on outcomes tested. Recently, clinical studies have increasingly utilized registries, electronic medical records, and other “real-world” “observational” data sets in place of or to supplement randomized trials. Notably, nonrandomized studiesare subject to confounding when enrollees who receive one treatment under investigation differ systematically from those receiving another, including selection bias where patient treatments are chosen by their physiciansrather than by randomization. In such studies, statistical adjustment by propensity-score matching (PSM), first described in 1983,
is commonly employed in an attempt to reduce bias from concomitant confounding variables (to “correct” for many baseline imbalances).PSM attempts to mimic randomization on observed covariates.PSM is also frequently used in post-hoc, retrospective, and subgroup analyses for similar reasons. Importantly, while PSM isvaluable,it is not all-inclusive. It is not readily apparent that practitionersrecognizethis clinically important limitation and consider it when interpreting/applying study results.Sometimes it doesn’t matter, but sometimes it can. Interestingly, a “report card on propensity-score matching in the cardiology literature” has found that the application of PSM in cardiology reports has been “poor”.
For this paper, Ireviewed 20 randomly selected manuscripts in Circulation, JACC, and Stroke from the past 4 years. I also re-reviewed two older atrial fibrillation (AF) studies.In the 20 papers, the factors chosen for PSM were usually but not always listed. Some were specific to the type of trial, e.g., prior surgeries for surgical studies, but almost all included: age; gender; hypertension; diabetes; coronary artery diseasehistory; heart failure or LVEF; non-ischemic heart diseases;AF +/- other rhythm detail; prior stroke; renal, hepatic, and pulmonary status; medication list; smoking;hyperlipidemia;selected blood tests, ECG findings, echocardiographic findings. Some included prior alcohol/drug abuse, weight.Notably, however, most often, these variables were considered only as present/absent and were virtually never considered in terms of severity (quantitatively). Also, rarely if ever regarded were specific drugs within a class, drug doses, drug interaction potential, or past patient historyalthough such could significantly affect study results.
While each of the above listed comorbidities are important to recognizeand adjust for by PSM with respect to theireffects on study results,additional potentially confounding and results-influencing factors may be present but go unnoted/unmentioned. For example, hypertensive patients may or may not have LV hypertrophy (LVH) but the presence/absence ofLVH is rarely if ever considered.Hypertensiveswith LVH have a poorer survival, two to four-fold greater cardiovascular (CV) morbidity, and a greater likelihood of developing AF despite antihypertensive treatment, Thus LVH may affect CV outcomes. Moreover, LVH resolution potential with antihypertensive drugs differs among the drug classes. Similarly, other factors, such as specific medications (beyond drug class), their dosing, or their possible drug interactions were only considered once in the 20 papers I reviewed. Additionally, none considered responses to prior drug trials or the duration of the comorbidities present. Consider: (1) In many trials, high dose statins have proven to be superior to lower doses in reducing major adverse cardiovascular outcomes. Yet, statin doses were not part of any propensity matching consideration that I examined. Moreover, all statins are not the same with respect to possible drug interactions. (2) Similarly, all beta blockers are not identical. Hepatically metabolized beta blockers can have up to 10-fold differences in serum concentrations and actions for a given dose, which is not the same for renally excreted beta blockers. Some have effects beyond beta receptor blockade. Some have demonstrated superiority in heart failure. Thus, simply noting beta blockers as present or absent should be clinically insufficient. (3) Likewise, it is well recognized that specific agents for diabetic management can have dramatically different effects on CV outcomes, and noting diabetes as present/absent without considering specific treatment(s) may be shortsighted. (4) The same is true for ACE inhibitors/ARBs, where outcomes across trials have not been uniform and where tissue penetrance and effects therefrom differ among agents with differences in clinical outcomes*(5) Finally disease duration and responses to prior therapy can dramatically alter treatment responses, but they are almost never considered with PSM. Here, the two older AF trials are particularly instructive. In the prospective, randomized, placebo-controlled sustained-release propafenone vs placebo AF trials, RAFT
lower efficacy rates were seen with the active drug in ERAFTvs RAFTdespite using identical study drug, dose, placebo, and manufacturer for treatment of the same arrhythmia. Importantly, ERAFT had greater AF burden, longer AF history, and more prior antiarrhythmic drug (AAD)failures. Importantly, disease severity and prior AAD failure both generally predict lower response to subsequent AAD administration. Simply comparing these two populations based on the presence of prior AF and on specific underlying disease and comorbidity list would have missed these important result-altering details. Finally, consider that even the CHA2DS2-VASc score, which has been included in PSM in many trials, can be a misleading comparator for both stroke and mortality.In AF, both older age and prior stroke have a greater risk for ensuing stroke than the other CHA2DS2-VASc score components and age is the single strongest mortality predictor. Thus, a 69-year old female diabetic hypertensive s/p an MI likely has a lower absolute risk of both stroke and death than an 89-year old male with a prior stroke and prior MI although both patient’s scores = 5. However, her risk would be higher if her hypertension had associated LVH and renal insufficiency.
In my opinion, propensity matching corrections should be considered valuable, but not clinically complete. Many papers recognize this and typically include statements in their limitations section such as: (a) “Observational studies often do not account for confounders, and the use of unadjusted values from these studies introduces bias”; (b) “Although adjustment was made for several variables,it is possible that residual confounders between thegroups could have been omitted in the analysis”; (c)
“Like any nonrandomized design, propensity matching maynot be able to balance unmeasured confounders”. Perhaps this has beenbest stated by Moss et al 
: “Propensity scoring is a powerful tool that enables excellentmatching of baseline characteristics, which may be superiorto that obtained in a randomized trial.However,if importantunobserved covariables are not identified and not enteredinto the propensity model, significant baseline differencesmay still exist between the two groups. Propensityscoring is not therefore a substitute for randomization.”
Althoughmany such confounders cannot be easily quantitated and included in a propensity score, I suggest that at least those that could be relevant to the results of the study being reported be recognized and listed, not just called residual confounders. In this waythe reader will knowwhat the investigators’ PSM did not include and canreflect on their relevance, possible impacts, and the best application of the study results to his/her patients. Is it not reasonable to suggest that such an effort be made so as to further enhance the link between a clinical study and clinical practice?